Methodology

Methodology

This page describes how the HerpesChance risk calculators were built, validated, and deployed. All calculators estimate statistical likelihood of seropositivity and are not diagnostic tests.

1. Data Foundation

All models were trained on publicly available data from the CDC National Health and Nutrition Examination Survey (NHANES). NHANES is a nationally representative cross-sectional survey that combines interviews, physical examinations, and laboratory tests on approximately 5,000 US residents per two-year cycle.

Survey Cycles Used

Disease NHANES Cycles Years Participants Serology Target
HSV-16 cycles (D–I)2005–201621,498LBXHE1 (IgG antibody)
HSV-26 cycles (D–I)2005–201617,622LBXHE2 (IgG antibody)
HCV6 cycles (D–I)2005–201644,370LBDHCV / LBXHCR (antibody + RNA)
HBV6 cycles (D–I)2005–201644,534LBXHBC (core antibody)
HPV (16, 6, HR)2 cycles (E–F)2007–20107,663LBX06 / LBX16 / LBX18 (Luminex)

NHANES data files were merged on the unique respondent identifier (SEQN). Demographics were inner-joined with laboratory serology results; questionnaire modules (sexual behavior, drug use, alcohol use, health insurance) were left-joined to preserve sample size when questionnaire data was missing.

Variable Harmonization

Several NHANES variable codes changed across cycles. These were harmonized during processing:

2. Feature Selection

HSV Models (v1)

Variables were selected through a two-stage process. First, univariate t-tests (continuous variables) and chi-squared tests (categorical variables) were performed against HSV serostatus to identify significantly correlated NHANES variables with low attrition. Variables passing a significance threshold were then entered into multivariate logistic regression, and those retaining significance in the joint model were included in the final feature set.

HCV, HBV, and HPV Models (v2)

An automated three-stage feature selection pipeline was used:

  1. Mutual Information screening: All candidate features were scored against the target using mutual information. Features with near-zero MI were eliminated.
  2. Statistical significance filter: Remaining features were tested for univariate association with the target. Only those meeting significance criteria were retained.
  3. Recursive Feature Elimination with Cross-Validation (RFECV): A gradient boosting estimator was used to iteratively remove the least important feature, with 5-fold cross-validated AUC guiding the optimal feature count.

This pipeline selected 20–25 features per disease from an initial candidate pool of approximately 50 NHANES variables.

3. Preprocessing Pipeline

Categorical Encoding

Categorical variables were one-hot encoded. Race/ethnicity was expanded into five binary indicator columns (with "Mexican-American" as the implicit reference category for v1 models). Income brackets were similarly encoded with "<$20K" as the reference. Binary variables (yes/no) were coded as 1/0.

Missing Data Imputation

Missing values were imputed using the population median of each feature from the training data. Median imputation was chosen for robustness to outliers and skewed distributions. The imputer is fitted on training data and applied identically at inference time, ensuring no data leakage.

Feature Scaling

Logistic regression models (HSV-1) were preceded by standardization (zero mean, unit variance) to ensure stable coefficient estimation. Gradient boosting models (all others) were not scaled, as tree-based algorithms are invariant to monotonic transformations of input features.

Derived Features

Several features were derived from raw NHANES variables during preprocessing:

4. Modeling Approach

Algorithms Evaluated

For each disease target, the following algorithms were compared under 5-fold stratified cross-validation:

  1. Logistic Regression (with and without class-weight balancing)
  2. Gradient Boosting Classifier (standard and tuned hyperparameters)
  3. Calibrated Gradient Boosting (isotonic calibration via CalibratedClassifierCV)
  4. Random Forest (with class-weight balancing, for imbalanced targets)

The algorithm with the highest cross-validated AUC was selected for each disease. Final models were then retrained on the full dataset.

Logistic Regression

The logistic regression model estimates the probability of seropositivity as:

P(Y = 1 | x) = 1 / (1 + exp(−(β0 + β1x1 + β2x2 + … + βpxp)))

where x is the vector of input features and β are learned coefficients. The model was fit with L2 regularization (default penalty strength C = 1.0) and a maximum of 1,000 solver iterations. Logistic regression was selected for HSV-1 because it matched gradient boosting performance (AUC 0.750 vs. 0.751) with a simpler, more interpretable model.

Gradient Boosting

Gradient boosting builds an additive ensemble of shallow decision trees, where each tree corrects the residual errors of the previous ensemble. At each iteration m, a new tree hm is fit to the negative gradient of the log-loss:

Fm(x) = Fm−1(x) + η · hm(x)

where η is the learning rate. Two hyperparameter configurations were used depending on class balance:

Parameter Standard (HSV-2) Tuned (HCV, HBV, HPV)
Number of trees200300
Max depth43
Learning rate (η)0.10.05
Subsample ratio0.80.7
Min samples per leaf1 (default)max(10, npos / 20)

The tuned configuration uses a lower learning rate, shallower trees, and a minimum leaf size proportional to the positive class count. This prevents the model from creating pure leaf nodes that memorize rare positive cases — critical for HCV, where only 1.45% of samples are positive.

Probability Calibration

Raw classifier outputs do not always correspond to true event frequencies. To correct this, calibration was applied using isotonic regression via scikit-learn's CalibratedClassifierCV with 5-fold internal cross-validation. Isotonic regression fits a non-decreasing piecewise-constant function that maps raw scores to observed positive rates:

Pcalibrated = fisotonic(Praw)    where    f(a) ≤ f(b)  for all  a ≤ b

Isotonic calibration was used for the HSV-2 model (applied during training) and all three HPV models (applied during a post-training recalibration step after the original models were found to overestimate probabilities).

5. Validation

Cross-Validation Protocol

All models were evaluated using 5-fold stratified cross-validation (random seed = 42, shuffle enabled). Stratification ensures that each fold preserves the class ratio of the full dataset, which is particularly important for imbalanced targets like HCV (1.45% positive).

Evaluation Metrics

Two primary metrics guided model selection and evaluation:

For highly imbalanced targets (HCV, HBV), F1 score and Average Precision (area under the precision-recall curve) were also monitored to ensure adequate sensitivity to the minority class.

Model Performance

Model Algorithm Samples Prevalence Features CV AUC CV Brier
HSV-1Logistic Regression21,49856.5%160.750 ± 0.0070.200 ± 0.003
HSV-1 (no sexual history)Logistic Regression21,49856.5%150.748 ± 0.0070.201 ± 0.003
HSV-2Calibrated Gradient Boosting17,62218.7%200.809 ± 0.0070.121 ± 0.001
HCVGradient Boosting (tuned)44,3701.45%250.901 ± 0.0130.011 ± 0.001
HBVGradient Boosting (tuned)44,5345.38%250.859 ± 0.0060.043 ± 0.000
HPV-16Gradient Boosting (recalibrated)7,66313.6%200.743 ± 0.0130.107 ± 0.002
HPV-6Gradient Boosting (recalibrated)7,66319.5%200.726 ± 0.0180.141 ± 0.003
HPV High-Risk (16/18)Gradient Boosting (recalibrated)7,66316.3%200.748 ± 0.0150.122 ± 0.002

All metrics are mean ± standard deviation from 5-fold stratified cross-validation (seed = 42).

6. Variables by Model

HSV Models

Variable HSV-1 HSV-2 NHANES Source
AgeRIDAGEYR
Biological sexRIAGENDR
Race/ethnicity (5 indicators)RIDRETH1 / RIDRETH3
Household income (4 indicators)INDHHIN2
College graduateDMDEDUC2
Age at first sexual contactSXD031
Lifetime sexual partnersSXD171 / SXD510
Household sizeDMDHHSIZ
Born in USADMDBORN4
Health insuranceHIQ011
Military serviceDMQMILIZ
12+ alcoholic drinks (lifetime)ALQ101
Ever used marijuanaDUQ200
Ever used hard drugs (derived)DUQ240 / DUQ370
In long-term relationship (derived)DMDMARTL

HCV, HBV, and HPV Models

The v2 models use an expanded feature set selected by the automated pipeline described above. Key disease-specific features include:

7. Handling Class Imbalance

Disease prevalence in the training data ranges from 1.45% (HCV) to 56.5% (HSV-1). For rare outcomes, naive models tend to predict low probabilities for all observations and fail to identify true positives. Several strategies were employed to address this:

Disease Prevalence Positive Cases Imbalance Strategy
HSV-156.5%12,162None needed (balanced)
HSV-218.7%3,303Isotonic calibration
HPV types13.6–19.5%1,043–1,497Tuned GB + post-hoc recalibration
HBV5.38%2,398Tuned GB + adaptive min leaf size
HCV1.45%641Tuned GB + adaptive min leaf size + F1/AP monitoring

The adaptive minimum leaf size (min_samples_leaf = max(10, npositive / 20)) prevents the gradient boosting model from splitting until pure leaf nodes are reached, which would overfit to spurious patterns in the minority class.

8. Inference Pipeline

Each trained model is serialized as a scikit-learn Pipeline object using joblib. The pipeline encapsulates all preprocessing stages (imputation, optional scaling) and the classifier into a single artifact, ensuring that the exact transformations applied during training are reproduced at inference time.

At inference, user inputs are encoded into a numeric feature vector matching the training schema. The probability of seropositivity is obtained as:

p̂ = pipeline.predict_proba(x)[0, 1]

The resulting probability is displayed as a percentage and visualized as a gauge chart. Percentile rank is computed by comparing the user's predicted probability against the distribution of predictions for the full training population.

9. Limitations

10. Clinical Guidance

These calculators are educational tools. For symptoms, exposure concerns, or diagnosis, consult a licensed clinician and appropriate laboratory testing. Standard confirmatory tests include:

11. Data Sources & References

Centers for Disease Control and Prevention (CDC). National Health and Nutrition Examination Survey (NHANES). Available at: www.cdc.gov/nchs/nhanes.

NHANES data files used:

All NHANES data is publicly available, de-identified, and contains no personally identifiable information. No IRB approval was required for this secondary analysis of public-use data.