Methodology

This page describes how the HerpesChance risk calculators were built, validated, and deployed. All calculators estimate statistical likelihood of seropositivity and are not diagnostic tests.

1. Data Foundation

All models were trained on publicly available data from the CDC National Health and Nutrition Examination Survey (NHANES). NHANES is a nationally representative cross-sectional survey that combines interviews, physical examinations, and laboratory tests on approximately 5,000 US residents per two-year cycle.

Survey Cycles Used

Disease	NHANES Cycles	Years	Participants	Serology Target
HSV-1	6 cycles (D–I)	2005–2016	21,498	LBXHE1 (IgG antibody)
HSV-2	6 cycles (D–I)	2005–2016	17,622	LBXHE2 (IgG antibody)
HCV	6 cycles (D–I)	2005–2016	44,370	LBDHCV / LBXHCR (antibody + RNA)
HBV	6 cycles (D–I)	2005–2016	44,534	LBXHBC (core antibody)
HPV (16, 6, HR)	2 cycles (E–F)	2007–2010	7,663	LBX06 / LBX16 / LBX18 (Luminex)

NHANES data files were merged on the unique respondent identifier (SEQN). Demographics were inner-joined with laboratory serology results; questionnaire modules (sexual behavior, drug use, alcohol use, health insurance) were left-joined to preserve sample size when questionnaire data was missing.

Variable Harmonization

Several NHANES variable codes changed across cycles. These were harmonized during processing:

Race/ethnicity: RIDRETH1 (cycles D–F) and RIDRETH3 (cycles G–I) were mapped to a consistent six-category scheme (Mexican-American, Other Hispanic, Non-Hispanic White, Non-Hispanic Black, Non-Hispanic Asian, Other).
Country of birth: DMDBORN (D), DMDBORN2 (E–F), DMDBORN4 (G–I) were binarized to born-in-USA vs. not.
Income: INDHHINC (D) and INDHHIN2 (E–I) were mapped to five brackets (<$20K, $20–45K, $45–75K, $75–100K, >$100K).
HCV serology: LBDHCV (2005–2012, confirmed antibody) and LBXHCR (2013–2016, RNA-confirmed) were unified. Codes 1 and 2 (antibody-positive) were treated as ever-infected; code 3 as negative.
Sentinel values: NHANES uses 7, 9, 77, 99, 777, 999, etc. to indicate "refused" or "don't know." All sentinel values were recoded to missing (NaN) before modeling.

2. Feature Selection

HSV Models (v1)

Variables were selected through a two-stage process. First, univariate t-tests (continuous variables) and chi-squared tests (categorical variables) were performed against HSV serostatus to identify significantly correlated NHANES variables with low attrition. Variables passing a significance threshold were then entered into multivariate logistic regression, and those retaining significance in the joint model were included in the final feature set.

HCV, HBV, and HPV Models (v2)

An automated three-stage feature selection pipeline was used:

Mutual Information screening: All candidate features were scored against the target using mutual information. Features with near-zero MI were eliminated.
Statistical significance filter: Remaining features were tested for univariate association with the target. Only those meeting significance criteria were retained.
Recursive Feature Elimination with Cross-Validation (RFECV): A gradient boosting estimator was used to iteratively remove the least important feature, with 5-fold cross-validated AUC guiding the optimal feature count.

This pipeline selected 20–25 features per disease from an initial candidate pool of approximately 50 NHANES variables.

3. Preprocessing Pipeline

Categorical Encoding

Categorical variables were one-hot encoded. Race/ethnicity was expanded into five binary indicator columns (with "Mexican-American" as the implicit reference category for v1 models). Income brackets were similarly encoded with "<$20K" as the reference. Binary variables (yes/no) were coded as 1/0.

Missing Data Imputation

Missing values were imputed using the population median of each feature from the training data. Median imputation was chosen for robustness to outliers and skewed distributions. The imputer is fitted on training data and applied identically at inference time, ensuring no data leakage.

Feature Scaling

Logistic regression models (HSV-1) were preceded by standardization (zero mean, unit variance) to ensure stable coefficient estimation. Gradient boosting models (all others) were not scaled, as tree-based algorithms are invariant to monotonic transformations of input features.

Derived Features

Several features were derived from raw NHANES variables during preprocessing:

Lifetime sexual partners: Male (SXD171) and female (SXD510) responses were merged and capped at 100 to prevent extreme outliers from inflating the scaler standard deviation.
Baby boomer indicator: Computed as 1 if the estimated birth year fell between 1945 and 1965 (birth year = survey cycle midpoint − age). This binary feature is a known strong predictor of HCV.
Poverty-to-income ratio: Income brackets were mapped to representative INDFMPIR values (<$20K→0.8, $20–45K→1.5, $45–75K→2.5, $75–100K→3.5, >$100K→5.0).
Ever used hard drugs (HSV-2): Derived as the logical OR of ever-used-cocaine (DUQ240) and ever-injected-drugs (DUQ370).
In long-term relationship (HSV-2): Derived from marital status — coded as 1 if married or living with a partner.

4. Modeling Approach

Algorithms Evaluated

For each disease target, the following algorithms were compared under 5-fold stratified cross-validation:

Logistic Regression (with and without class-weight balancing)
Gradient Boosting Classifier (standard and tuned hyperparameters)
Calibrated Gradient Boosting (isotonic calibration via CalibratedClassifierCV)
Random Forest (with class-weight balancing, for imbalanced targets)

The algorithm with the highest cross-validated AUC was selected for each disease. Final models were then retrained on the full dataset.

Logistic Regression

The logistic regression model estimates the probability of seropositivity as:

P(Y = 1 | x) = 1 / (1 + exp(−(β₀ + β₁x₁ + β₂x₂ + … + β_px_p)))

where x is the vector of input features and β are learned coefficients. The model was fit with L2 regularization (default penalty strength C = 1.0) and a maximum of 1,000 solver iterations. Logistic regression was selected for HSV-1 because it matched gradient boosting performance (AUC 0.750 vs. 0.751) with a simpler, more interpretable model.

Gradient Boosting

Gradient boosting builds an additive ensemble of shallow decision trees, where each tree corrects the residual errors of the previous ensemble. At each iteration m, a new tree h_m is fit to the negative gradient of the log-loss:

F_m(x) = F_m−1(x) + η · h_m(x)

where η is the learning rate. Two hyperparameter configurations were used depending on class balance:

Parameter	Standard (HSV-2)	Tuned (HCV, HBV, HPV)
Number of trees	200	300
Max depth	4	3
Learning rate (η)	0.1	0.05
Subsample ratio	0.8	0.7
Min samples per leaf	1 (default)	max(10, n_pos / 20)

The tuned configuration uses a lower learning rate, shallower trees, and a minimum leaf size proportional to the positive class count. This prevents the model from creating pure leaf nodes that memorize rare positive cases — critical for HCV, where only 1.45% of samples are positive.

Probability Calibration

Raw classifier outputs do not always correspond to true event frequencies. To correct this, calibration was applied using isotonic regression via scikit-learn's CalibratedClassifierCV with 5-fold internal cross-validation. Isotonic regression fits a non-decreasing piecewise-constant function that maps raw scores to observed positive rates:

P_calibrated = f_isotonic(P_raw) where f(a) ≤ f(b) for all a ≤ b

Isotonic calibration was used for the HSV-2 model (applied during training) and all three HPV models (applied during a post-training recalibration step after the original models were found to overestimate probabilities).

5. Validation

Cross-Validation Protocol

All models were evaluated using 5-fold stratified cross-validation (random seed = 42, shuffle enabled). Stratification ensures that each fold preserves the class ratio of the full dataset, which is particularly important for imbalanced targets like HCV (1.45% positive).

Evaluation Metrics

Two primary metrics guided model selection and evaluation:

AUC-ROC (Area Under the Receiver Operating Characteristic Curve): Measures the model's ability to discriminate between positive and negative cases across all classification thresholds. An AUC of 1.0 indicates perfect discrimination; 0.5 indicates performance no better than random chance.
Brier Score: The mean squared error between predicted probabilities and observed binary outcomes. It simultaneously captures both discrimination and calibration quality:

BS = (1/N) ∑_i=1^N (p̂_i − y_i)²

where p̂_i is the predicted probability and y_i ∈ {0, 1} is the observed outcome. Lower is better; a perfectly calibrated model predicting the base rate for all observations would achieve BS = p(1 − p).

For highly imbalanced targets (HCV, HBV), F1 score and Average Precision (area under the precision-recall curve) were also monitored to ensure adequate sensitivity to the minority class.

Model Performance

Model	Algorithm	Samples	Prevalence	Features	CV AUC	CV Brier
HSV-1	Logistic Regression	21,498	56.5%	16	0.750 ± 0.007	0.200 ± 0.003
HSV-1 (no sexual history)	Logistic Regression	21,498	56.5%	15	0.748 ± 0.007	0.201 ± 0.003
HSV-2	Calibrated Gradient Boosting	17,622	18.7%	20	0.809 ± 0.007	0.121 ± 0.001
HCV	Gradient Boosting (tuned)	44,370	1.45%	25	0.901 ± 0.013	0.011 ± 0.001
HBV	Gradient Boosting (tuned)	44,534	5.38%	25	0.859 ± 0.006	0.043 ± 0.000
HPV-16	Gradient Boosting (recalibrated)	7,663	13.6%	20	0.743 ± 0.013	0.107 ± 0.002
HPV-6	Gradient Boosting (recalibrated)	7,663	19.5%	20	0.726 ± 0.018	0.141 ± 0.003
HPV High-Risk (16/18)	Gradient Boosting (recalibrated)	7,663	16.3%	20	0.748 ± 0.015	0.122 ± 0.002

All metrics are mean ± standard deviation from 5-fold stratified cross-validation (seed = 42).

6. Variables by Model

HSV Models

Variable	HSV-1	HSV-2	NHANES Source
Age	✓	✓	RIDAGEYR
Biological sex	✓	✓	RIAGENDR
Race/ethnicity (5 indicators)	✓	✓	RIDRETH1 / RIDRETH3
Household income (4 indicators)	✓	✓	INDHHIN2
College graduate	✓	✓	DMDEDUC2
Age at first sexual contact	✓	✓	SXD031
Lifetime sexual partners	✓	✓	SXD171 / SXD510
Household size	✓		DMDHHSIZ
Born in USA	✓		DMDBORN4
Health insurance		✓	HIQ011
Military service		✓	DMQMILIZ
12+ alcoholic drinks (lifetime)		✓	ALQ101
Ever used marijuana		✓	DUQ200
Ever used hard drugs (derived)		✓	DUQ240 / DUQ370
In long-term relationship (derived)		✓	DMDMARTL

HCV, HBV, and HPV Models

The v2 models use an expanded feature set selected by the automated pipeline described above. Key disease-specific features include:

HCV: Ever injected drugs (by far the strongest predictor, ~55% feature importance), baby boomer birth cohort, told liver condition, poverty ratio, ever used cocaine, PHQ-9 depression score.
HBV: Age, race (Asian 19.6% prevalence, Black 8.7%), born outside USA (12.3% vs. 3.4%), told liver condition, BMI, education level, waist circumference.
HPV: Biological sex (female seroprevalence approximately 4× higher than male), waist circumference, age at first sex, lifetime partners, ever had genital warts, PHQ-9 score, marital status.

7. Handling Class Imbalance

Disease prevalence in the training data ranges from 1.45% (HCV) to 56.5% (HSV-1). For rare outcomes, naive models tend to predict low probabilities for all observations and fail to identify true positives. Several strategies were employed to address this:

Disease	Prevalence	Positive Cases	Imbalance Strategy
HSV-1	56.5%	12,162	None needed (balanced)
HSV-2	18.7%	3,303	Isotonic calibration
HPV types	13.6–19.5%	1,043–1,497	Tuned GB + post-hoc recalibration
HBV	5.38%	2,398	Tuned GB + adaptive min leaf size
HCV	1.45%	641	Tuned GB + adaptive min leaf size + F1/AP monitoring

The adaptive minimum leaf size (min_samples_leaf = max(10, n_positive / 20)) prevents the gradient boosting model from splitting until pure leaf nodes are reached, which would overfit to spurious patterns in the minority class.

8. Inference Pipeline

Each trained model is serialized as a scikit-learn Pipeline object using joblib. The pipeline encapsulates all preprocessing stages (imputation, optional scaling) and the classifier into a single artifact, ensuring that the exact transformations applied during training are reproduced at inference time.

At inference, user inputs are encoded into a numeric feature vector matching the training schema. The probability of seropositivity is obtained as:

p̂ = pipeline.predict_proba(x)[0, 1]

The resulting probability is displayed as a percentage and visualized as a gauge chart. Percentile rank is computed by comparing the user's predicted probability against the distribution of predictions for the full training population.

9. Limitations

Outputs are approximate probability estimates, not diagnoses. They do not confirm or deny infection.
Models reflect correlations in the US civilian, non-institutionalized population aged 14–49. Accuracy may degrade outside this demographic.
NHANES HSV serology was discontinued after 2015–2016. HPV serology is only available for 2007–2010. Prevalence trends may have shifted since these data were collected.
Some variables (notably lifetime sexual partners) had substantial missing data in NHANES. While median imputation preserves overall distributions, it introduces uncertainty at the individual level.
Correlation does not equal causation. Variables like income or drug use are statistical correlates that improve prediction accuracy — they are not direct causes of infection.
The models cannot account for factors not collected by NHANES, including condom use frequency, partner STD status, antiviral medication use, or vaccination history (HPV, HBV).
HPV models are based on only two NHANES cycles (7,663 participants), resulting in wider confidence intervals compared to the HSV and hepatitis models.

10. Clinical Guidance

These calculators are educational tools. For symptoms, exposure concerns, or diagnosis, consult a licensed clinician and appropriate laboratory testing. Standard confirmatory tests include:

HSV: Type-specific IgG blood tests (HerpeSelect, BioPlex) or PCR swab of active lesions. Western blot for confirmation of equivocal results.
HCV: HCV antibody test followed by HCV RNA (PCR) for confirmation.
HBV: Hepatitis B surface antigen (HBsAg), surface antibody (anti-HBs), and core antibody (anti-HBc) panel.
HPV: Pap smear with HPV co-testing (cervical), or HPV DNA test. No FDA-approved serology test is available for clinical use.

11. Data Sources & References

Centers for Disease Control and Prevention (CDC). National Health and Nutrition Examination Survey (NHANES). Available at: www.cdc.gov/nchs/nhanes.

NHANES data files used:

Demographics (DEMO) — 2005–2016
HSV Serology (HSV) — 2005–2016
Sexual Behavior (SXQ) — 2005–2016
Health Insurance (HIQ) — 2005–2016
Drug Use (DUQ) — 2005–2016
Alcohol Use (ALQ) — 2005–2016
Hepatitis C (HEPC) — 2005–2016
Hepatitis B (HEPBD, HEPB_S) — 2005–2016
HPV Serology (HPVSER) — 2007–2010

All NHANES data is publicly available, de-identified, and contains no personally identifiable information. No IRB approval was required for this secondary analysis of public-use data.