This page describes how the HerpesChance risk calculators were built, validated, and deployed. All calculators estimate statistical likelihood of seropositivity and are not diagnostic tests.
All models were trained on publicly available data from the CDC National Health and Nutrition Examination Survey (NHANES). NHANES is a nationally representative cross-sectional survey that combines interviews, physical examinations, and laboratory tests on approximately 5,000 US residents per two-year cycle.
| Disease | NHANES Cycles | Years | Participants | Serology Target |
|---|---|---|---|---|
| HSV-1 | 6 cycles (D–I) | 2005–2016 | 21,498 | LBXHE1 (IgG antibody) |
| HSV-2 | 6 cycles (D–I) | 2005–2016 | 17,622 | LBXHE2 (IgG antibody) |
| HCV | 6 cycles (D–I) | 2005–2016 | 44,370 | LBDHCV / LBXHCR (antibody + RNA) |
| HBV | 6 cycles (D–I) | 2005–2016 | 44,534 | LBXHBC (core antibody) |
| HPV (16, 6, HR) | 2 cycles (E–F) | 2007–2010 | 7,663 | LBX06 / LBX16 / LBX18 (Luminex) |
NHANES data files were merged on the unique respondent identifier (SEQN). Demographics were inner-joined with laboratory serology results; questionnaire modules (sexual behavior, drug use, alcohol use, health insurance) were left-joined to preserve sample size when questionnaire data was missing.
Several NHANES variable codes changed across cycles. These were harmonized during processing:
Variables were selected through a two-stage process. First, univariate t-tests (continuous variables) and chi-squared tests (categorical variables) were performed against HSV serostatus to identify significantly correlated NHANES variables with low attrition. Variables passing a significance threshold were then entered into multivariate logistic regression, and those retaining significance in the joint model were included in the final feature set.
An automated three-stage feature selection pipeline was used:
This pipeline selected 20–25 features per disease from an initial candidate pool of approximately 50 NHANES variables.
Categorical variables were one-hot encoded. Race/ethnicity was expanded into five binary indicator columns (with "Mexican-American" as the implicit reference category for v1 models). Income brackets were similarly encoded with "<$20K" as the reference. Binary variables (yes/no) were coded as 1/0.
Missing values were imputed using the population median of each feature from the training data. Median imputation was chosen for robustness to outliers and skewed distributions. The imputer is fitted on training data and applied identically at inference time, ensuring no data leakage.
Logistic regression models (HSV-1) were preceded by standardization (zero mean, unit variance) to ensure stable coefficient estimation. Gradient boosting models (all others) were not scaled, as tree-based algorithms are invariant to monotonic transformations of input features.
Several features were derived from raw NHANES variables during preprocessing:
For each disease target, the following algorithms were compared under 5-fold stratified cross-validation:
The algorithm with the highest cross-validated AUC was selected for each disease. Final models were then retrained on the full dataset.
The logistic regression model estimates the probability of seropositivity as:
P(Y = 1 | x) = 1 / (1 + exp(−(β0 + β1x1 + β2x2 + … + βpxp)))
where x is the vector of input features and β are learned coefficients. The model was fit with L2 regularization (default penalty strength C = 1.0) and a maximum of 1,000 solver iterations. Logistic regression was selected for HSV-1 because it matched gradient boosting performance (AUC 0.750 vs. 0.751) with a simpler, more interpretable model.
Gradient boosting builds an additive ensemble of shallow decision trees, where each tree corrects the residual errors of the previous ensemble. At each iteration m, a new tree hm is fit to the negative gradient of the log-loss:
Fm(x) = Fm−1(x) + η · hm(x)
where η is the learning rate. Two hyperparameter configurations were used depending on class balance:
| Parameter | Standard (HSV-2) | Tuned (HCV, HBV, HPV) |
|---|---|---|
| Number of trees | 200 | 300 |
| Max depth | 4 | 3 |
| Learning rate (η) | 0.1 | 0.05 |
| Subsample ratio | 0.8 | 0.7 |
| Min samples per leaf | 1 (default) | max(10, npos / 20) |
The tuned configuration uses a lower learning rate, shallower trees, and a minimum leaf size proportional to the positive class count. This prevents the model from creating pure leaf nodes that memorize rare positive cases — critical for HCV, where only 1.45% of samples are positive.
Raw classifier outputs do not always correspond to true event frequencies. To correct this, calibration was applied using isotonic regression via scikit-learn's CalibratedClassifierCV with 5-fold internal cross-validation. Isotonic regression fits a non-decreasing piecewise-constant function that maps raw scores to observed positive rates:
Pcalibrated = fisotonic(Praw) where f(a) ≤ f(b) for all a ≤ b
Isotonic calibration was used for the HSV-2 model (applied during training) and all three HPV models (applied during a post-training recalibration step after the original models were found to overestimate probabilities).
All models were evaluated using 5-fold stratified cross-validation (random seed = 42, shuffle enabled). Stratification ensures that each fold preserves the class ratio of the full dataset, which is particularly important for imbalanced targets like HCV (1.45% positive).
Two primary metrics guided model selection and evaluation:
BS = (1/N) ∑i=1N (p̂i − yi)2
For highly imbalanced targets (HCV, HBV), F1 score and Average Precision (area under the precision-recall curve) were also monitored to ensure adequate sensitivity to the minority class.
| Model | Algorithm | Samples | Prevalence | Features | CV AUC | CV Brier |
|---|---|---|---|---|---|---|
| HSV-1 | Logistic Regression | 21,498 | 56.5% | 16 | 0.750 ± 0.007 | 0.200 ± 0.003 |
| HSV-1 (no sexual history) | Logistic Regression | 21,498 | 56.5% | 15 | 0.748 ± 0.007 | 0.201 ± 0.003 |
| HSV-2 | Calibrated Gradient Boosting | 17,622 | 18.7% | 20 | 0.809 ± 0.007 | 0.121 ± 0.001 |
| HCV | Gradient Boosting (tuned) | 44,370 | 1.45% | 25 | 0.901 ± 0.013 | 0.011 ± 0.001 |
| HBV | Gradient Boosting (tuned) | 44,534 | 5.38% | 25 | 0.859 ± 0.006 | 0.043 ± 0.000 |
| HPV-16 | Gradient Boosting (recalibrated) | 7,663 | 13.6% | 20 | 0.743 ± 0.013 | 0.107 ± 0.002 |
| HPV-6 | Gradient Boosting (recalibrated) | 7,663 | 19.5% | 20 | 0.726 ± 0.018 | 0.141 ± 0.003 |
| HPV High-Risk (16/18) | Gradient Boosting (recalibrated) | 7,663 | 16.3% | 20 | 0.748 ± 0.015 | 0.122 ± 0.002 |
All metrics are mean ± standard deviation from 5-fold stratified cross-validation (seed = 42).
| Variable | HSV-1 | HSV-2 | NHANES Source |
|---|---|---|---|
| Age | ✓ | ✓ | RIDAGEYR |
| Biological sex | ✓ | ✓ | RIAGENDR |
| Race/ethnicity (5 indicators) | ✓ | ✓ | RIDRETH1 / RIDRETH3 |
| Household income (4 indicators) | ✓ | ✓ | INDHHIN2 |
| College graduate | ✓ | ✓ | DMDEDUC2 |
| Age at first sexual contact | ✓ | ✓ | SXD031 |
| Lifetime sexual partners | ✓ | ✓ | SXD171 / SXD510 |
| Household size | ✓ | DMDHHSIZ | |
| Born in USA | ✓ | DMDBORN4 | |
| Health insurance | ✓ | HIQ011 | |
| Military service | ✓ | DMQMILIZ | |
| 12+ alcoholic drinks (lifetime) | ✓ | ALQ101 | |
| Ever used marijuana | ✓ | DUQ200 | |
| Ever used hard drugs (derived) | ✓ | DUQ240 / DUQ370 | |
| In long-term relationship (derived) | ✓ | DMDMARTL |
The v2 models use an expanded feature set selected by the automated pipeline described above. Key disease-specific features include:
Disease prevalence in the training data ranges from 1.45% (HCV) to 56.5% (HSV-1). For rare outcomes, naive models tend to predict low probabilities for all observations and fail to identify true positives. Several strategies were employed to address this:
| Disease | Prevalence | Positive Cases | Imbalance Strategy |
|---|---|---|---|
| HSV-1 | 56.5% | 12,162 | None needed (balanced) |
| HSV-2 | 18.7% | 3,303 | Isotonic calibration |
| HPV types | 13.6–19.5% | 1,043–1,497 | Tuned GB + post-hoc recalibration |
| HBV | 5.38% | 2,398 | Tuned GB + adaptive min leaf size |
| HCV | 1.45% | 641 | Tuned GB + adaptive min leaf size + F1/AP monitoring |
The adaptive minimum leaf size (min_samples_leaf = max(10, npositive / 20)) prevents the gradient boosting model from splitting until pure leaf nodes are reached, which would overfit to spurious patterns in the minority class.
Each trained model is serialized as a scikit-learn Pipeline object using joblib. The pipeline encapsulates all preprocessing stages (imputation, optional scaling) and the classifier into a single artifact, ensuring that the exact transformations applied during training are reproduced at inference time.
At inference, user inputs are encoded into a numeric feature vector matching the training schema. The probability of seropositivity is obtained as:
p̂ = pipeline.predict_proba(x)[0, 1]
The resulting probability is displayed as a percentage and visualized as a gauge chart. Percentile rank is computed by comparing the user's predicted probability against the distribution of predictions for the full training population.
These calculators are educational tools. For symptoms, exposure concerns, or diagnosis, consult a licensed clinician and appropriate laboratory testing. Standard confirmatory tests include:
Centers for Disease Control and Prevention (CDC). National Health and Nutrition Examination Survey (NHANES). Available at: www.cdc.gov/nchs/nhanes.
NHANES data files used:
All NHANES data is publicly available, de-identified, and contains no personally identifiable information. No IRB approval was required for this secondary analysis of public-use data.