Add synthetic-EHR generative evaluation metrics#1148
Open
chufangao wants to merge 3 commits into
Open
Conversation
Adds pyhealth/metrics/generative/, a subpackage for evaluating synthetic EHR data along privacy, utility, and statistical-fidelity axes: - privacy.py: NNAAR, membership inference attack, discriminator privacy - utility.py: machine learning efficacy (TRTR vs TSTR), code-prevalence similarity (R2, Pearson, RMSE) - utils.py: shared data prep, an LSTM classifier, and a random-forest baseline - evaluate_synthetic_ehr(): convenience orchestrator for the full suite These functions are ported from a standalone evaluation script. The MIMIC-specific data-loading/CLI glue is dropped; the metrics work on any flat EHR dataframe. Public functions are re-exported from pyhealth.metrics. Adds unit tests in tests/core/test_generative_metrics.py and Sphinx docs. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Adds pyhealth/metrics/generative/, a subpackage for evaluating synthetic EHR data along privacy, utility, and statistical-fidelity axes: - privacy.py: NNAAR, membership inference attack, discriminator privacy - utility.py: machine learning efficacy (TRTR vs TSTR), code-prevalence similarity (R2, Pearson, RMSE) - utils.py: shared data prep, an LSTM classifier, and a random-forest baseline - evaluate_synthetic_ehr(): convenience orchestrator for the full suite These functions are ported from a standalone evaluation script. The MIMIC-specific data-loading/CLI glue is dropped; the metrics work on any flat EHR dataframe. Public functions are re-exported from pyhealth.metrics. Adds unit tests in tests/core/test_generative_metrics.py and Sphinx docs. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jhnwu3
reviewed
May 20, 2026
|
|
||
|
|
||
| def calc_nnaar( | ||
| train_ehr: pd.DataFrame, |
Collaborator
There was a problem hiding this comment.
Is there a reason why we do this in dataframes? Maybe, we can chat about this.
Collaborator
Author
There was a problem hiding this comment.
Yeah we could change it to the nested sequence actually since for this task, it all reduces down to a sequence of sequences. Edit, actually maybe we should keep it for utility calculation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a complete synthetic-EHR pipeline to PyHealth: a shared generation
task, five generator models, a generative metrics subpackage, an
end-to-end example, and tests for all of the above.
What's new
1. Shared task —
pyhealth/tasks/generate_ehr.pyEHRGeneration— base class that emits one sample per patient,{"patient_id", "visits": [[code, ...], ...]}, processed byNestedSequenceProcessor. Unconditional (no output labels).EHRGenerationMIMIC3/EHRGenerationMIMIC4— dataset-specificsubclasses; only the event type and code attribute differ.
decode_datasetandto_evaluation_dataframeconvert realSampleDatasets and any generator'sgenerate()output into thelong-form
(id, time, visit_codes, labels)dataframe that the metricsconsume.
2. Generator models —
pyhealth/models/generators/All five share the
EHRGenerationtask and expose the sametrain_model(train_dataset, val_dataset=None)/generate(num_samples)API.halo.pygpt2.pypromptehr.pymedgan.pycorgan.pyEach model is registered as a sub-module so
.parameters()/.to(device)work, owns its own training loop (best-checkpoint saving included), and
returns decoded code strings from
generate().3. Metrics subpackage —
pyhealth/metrics/generative/Evaluation along three axes:
privacy.py—calc_nnaar(Nearest Neighbor Adversarial AccuracyRisk),
calc_membership_inference(membership inference attack), andcompute_discriminator_privacy(real-vs-synthetic discriminator score).utility.py—compute_mle(machine learning efficacy, TRTR vs TSTR)and
compute_prevalence_metrics(code-prevalence similarity: R², Pearson,RMSE).
utils.py— shared data prep, a self-contained LSTM classifier, and arandom-forest baseline.
evaluate_synthetic_ehr()— convenience orchestrator that runs thefull suite and returns one merged
{metric: (mean, std)}dict.Metrics operate on flat dataframes (
id,time,visit_codes,labels),so they work for any generator. Public functions are re-exported from
pyhealth.metrics.Port cleanups
logginginstead of bareprintcalls..cpu().numpy()).scipy.stats.pearsonrwithnumpy.corrcoefto avoid anundeclared
scipydependency.4. End-to-end example —
examples/halo_mimic3.pyLoads MIMIC-III → applies
EHRGenerationMIMIC3→ trains HALO → generatessynthetic patients → runs the full
evaluate_synthetic_ehrsuite and printseach
(mean, std)metric. Verified to run end-to-end on thedev=Truesubset (NNAAR, MIA, MLE TRTR/TSTR, discriminator privacy, prevalence all
produced).
5. Tests
All passing.
tests/core/test_generative_metrics.pytests/core/test_halo.pytests/core/test_gpt2.pytests/core/test_promptehr.pytests/core/test_medgan.pytests/core/test_corgan.pyThe metrics suite includes 5 behavioral tests that verify each metric
responds sensibly across three synthetic datasets — an exact copy of the
training data, a similar set (~15% of codes perturbed), and a different
set (disjoint code vocabulary):
0 → 0.03 → 0.26; exact copy → RMSE 0, R²/Pearson = 11.0 → 0.1 → 0.01.0 → 0.94 → 0.46(chance for unrelated data)1.0 → 0.98 → 0.81Notes
bag-of-codes generators (MedGAN, CorGAN) — those should be evaluated with
metrics="privacy"plus the prevalence metrics. A future revision willlet callers plug in a static-label task (mortality, readmission,
"ever diagnosed with X") so MLE is meaningful for both families.
model predicts a constant on identical features, so the score reflects
test-split balance rather than 0.5). The behavioral test asserts the
robust direction — disjoint synthetic data is cleanly flagged while
real-derived data is not.