promptfoo · mldangelo-oai · May 24, 2026 · May 24, 2026 · May 24, 2026 · May 24, 2026
diff --git a/.github/workflows/README.md b/.github/workflows/README.md
@@ -16,4 +16,7 @@ Python CI ignores documentation-only PRs, which are handled by the documentation
 
 The performance workflow compares workload-oriented benchmarks between the PR
 base and head, posts a sticky summary comment on same-repo PRs, uploads JSON and
-Markdown artifacts, and reports regressions without blocking the PR.
+Markdown artifacts, and reports comparative regressions without blocking the
+PR. It separately runs the cache-disabled retained-memory stability guard from
+`tests/test_performance_benchmarks.py`, which fails the workflow if repeat scans
+retain excessive memory.
diff --git a/.github/workflows/perf.yml b/.github/workflows/perf.yml
@@ -9,6 +9,7 @@ on:
       - "tests/helpers/**"
       - "tests/conftest.py"
       - "tests/test_benchmark_report.py"
+      - "tests/test_performance_benchmarks.py"
       - "scripts/benchmark_report.py"
       - "pyproject.toml"
       - "uv.lock"
@@ -23,6 +24,7 @@ on:
       - "tests/helpers/**"
       - "tests/conftest.py"
       - "tests/test_benchmark_report.py"
+      - "tests/test_performance_benchmarks.py"
       - "scripts/benchmark_report.py"
       - "pyproject.toml"
       - "uv.lock"
@@ -153,6 +155,14 @@ jobs:
           cat "$BENCHMARK_ARTIFACT_DIR/benchmark-current.md" >> "$BENCHMARK_ARTIFACT_DIR/benchmark-summary.md"
           cat "$BENCHMARK_ARTIFACT_DIR/benchmark-summary.md" >> "$GITHUB_STEP_SUMMARY"
 
+      - name: Run retained-memory stability guard
+        env:
+          PROMPTFOO_DISABLE_TELEMETRY: "1"
+        run: |
+          uv run --locked --with psutil pytest \
+            tests/test_performance_benchmarks.py::TestPerformanceBenchmarks::test_memory_usage_stability \
+            -q
+
       - name: Comment benchmark summary on PR
         if: >
           always() &&

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -17,6 +17,57 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - stop flagging a false-positive ONNX Python operator when tensor weight bytes coincidentally spell `PyOp`
 - detect Python operators declared in nested ONNX graphs and functions
 - distinguish ASCII-serialized Torch7 artifacts from plain PyTorch source text
+- route renamed and unknown-field-prefixed CoreML models, including valid unknown groups, reordered fields, and bounded routing candidates, through custom-code and metadata analysis
+- avoid inconclusive protobuf-candidate noise for fully inspected scalar-only text and Keras-owned JSON members while preserving binary-tailed candidates for analysis
+- preserve archive member incomplete-outcome reasons when nested tentative analysis also fails closed
+- route renamed TensorFlow SavedModel and MetaGraph protobufs through unsafe-operation analysis
+- route renamed ONNX protobuf models with prefixed unknown fields through content analysis and fail closed on unresolved or incomplete structure
+- preserve ambiguous budget-exhausted protobuf candidates for tentative analysis without misclassifying non-ONNX payloads
+- route signature-confirmed CNTK and LightGBM artifacts with misleading filenames through security analysis
+- share trusted-content routing across direct, nested, and helper scans so renamed specialized archives retain their format-specific analysis
+- preserve dangerous callable findings when embedded source captures a callable before a later safe overwrite
+- detect high-risk embedded Python callables invoked through explicit `.__call__` wrappers
+- avoid embedded Python execution findings after statically certain safe `setattr` replacement
+- avoid embedded Python execution findings after statically certain namespace-mapping replacement
+- model reflective and saved namespace-map replacements when their receiver remains certain
+- model deterministic mapping-helper replacements when their helper and receiver remain certain
+- avoid embedded Python execution findings after a certain discarded namespace-map removal
+- avoid embedded Python execution findings after certain key-specific namespace deletion
+- avoid embedded Python execution findings after certain namespace-map clearing or certain map mutator calls themselves
+- avoid embedded Python process-execution findings for known non-executing `subprocess` formatting and result/error APIs
+- detect embedded Python `os.exec*`, `os.spawn*`, `os.posix_spawn*`, and `os.startfile` process-launch calls
+- detect JIT-scanned embedded Python `os.posix_spawn*` and `os.startfile` process-launch calls while suppressing certain safe replacements
+- detect JIT-scanned embedded Python `subprocess.check_call`, `getoutput`, and `getstatusoutput` calls while suppressing certain safe replacements
+- detect embedded Python `asyncio.create_subprocess_exec` and `asyncio.create_subprocess_shell` process-launch calls
+- preserve embedded Python execution findings when a replacement receiver is conditional, aliased, or a runtime argument
+- fail closed when nested NeMo checkpoint or referenced-artifact analysis is explicitly incomplete
+- preserve concrete nested security findings from checkpoint and referenced artifacts inside NeMo archives
+- keep PyTorch ZIP path traversal findings attributed to archive safety rules regardless of member names
+- classify executable archive members by their hazard type rather than attacker-controlled name fragments
+- preserve Keras ZIP Python and executable member attribution when filenames contain misleading pickle terms
+- classify incomplete ExecuTorch format scans and embedded Python members without misreporting eval/exec findings
+- classify corrupt magic-confirmed TAR parsing as incomplete analysis rather than a security finding
+- preserve PyTorch ZIP findings when later analysis fails and classify parse failures as incomplete coverage
+- classify Keras ZIP archive-read failures as incomplete coverage while preserving earlier security findings
+- classify Keras H5 read failures as incomplete coverage while preserving earlier security findings
+- classify CoreML parser and traversal coverage gaps as incomplete analysis while preserving concrete findings
+- classify malformed recognized ONNX model parsing as incomplete coverage rather than a security finding
+- preserve MetaGraph security findings from malformed content-routed payloads while reporting incomplete coverage
+- preserve SavedModel security findings from malformed content-routed payloads while reporting incomplete coverage
+- avoid reporting ordinary `sklearn` references in Skops model-card prose as unsafe joblib fallback evidence
+- detect high-risk Python archive-member calls dispatched through static namespace and attribute lookup indirection
+- inspect nested PMML extension attributes for code-shaped execution indicators
+- detect statically obscured high-risk calls in TorchServe handler source
+- avoid classifying SafeTensors documentation examples as executable metadata payloads
+- detect statically obscured builtin execution calls in embedded JIT source analysis
+- share intrinsic builtin namespace execution detection across embedded Python entrypoints
+- route signature-confirmed RKNN artifacts with misleading filenames through security analysis
+- detect embedded Python builtin execution recovered through static global namespace lookups
+- detect embedded Python builtin execution reached through aliased global namespace mappings
+- avoid embedded Python builtin execution findings after statically safe callable overwrites
+- detect embedded Python builtin execution dispatched through aliased namespace accessors
+- avoid embedded Python builtin execution findings after statically safe direct namespace mutations
+- avoid embedded Python builtin execution findings after statically certain aliased namespace mutations
 
 ## [0.2.45](https://github.com/promptfoo/modelaudit/compare/v0.2.44...v0.2.45) (2026-05-03)
 

diff --git a/README.md b/README.md
@@ -62,7 +62,7 @@ Files scanned: 1 | Issues found: 2 critical, 1 warning
 
 ## Supported Formats
 
-ModelAudit includes 44 registered scanners covering model, archive, and configuration formats:
+ModelAudit includes 45 registered scanners covering model, archive, and configuration formats:
 
 | Format                  | Extensions                                                                | Risk   |
 | ----------------------- | ------------------------------------------------------------------------- | ------ |
@@ -74,7 +74,7 @@ ModelAudit includes 44 registered scanners covering model, archive, and configur
 | **TensorFlow**          | `.pb`, `.meta`, SavedModel dirs                                           | MEDIUM |
 | **Keras**               | `.h5`, `.hdf5`, `.keras`                                                  | MEDIUM |
 | **ONNX**                | `.onnx`                                                                   | MEDIUM |
-| **CoreML**              | `.mlmodel`                                                                | LOW    |
+| **CoreML**              | `.mlmodel`, structurally valid renamed artifacts                          | LOW    |
 | **MXNet**               | `*-symbol.json`, `*-NNNN.params`                                          | LOW    |
 | **NeMo**                | `.nemo`                                                                   | MEDIUM |
 | **CNTK**                | `.dnn`, `.cmf`                                                            | MEDIUM |
@@ -99,6 +99,9 @@ ModelAudit includes 44 registered scanners covering model, archive, and configur
 
 Plus scanners for ZIP, TAR, 7-Zip, OCI layers, Jinja2 templates, JSON/YAML metadata, manifests, model cards, text files, and RAR recognition. RAR archives are reported as unsupported/fail-closed instead of being skipped.
 
+Structurally valid TensorFlow SavedModel and MetaGraph protobufs are also recognized when renamed to non-model suffixes.
+CoreML content routing preserves bounded ambiguous candidates for static custom-code and metadata analysis.
+
 [View complete format documentation](https://www.promptfoo.dev/docs/model-audit/scanners/)
 
 ## Remote Sources

diff --git a/docs/agents/architecture.md b/docs/agents/architecture.md
@@ -17,9 +17,11 @@
 ## Routing & Coverage Invariants
 
 - Prefer trusted file structure and bounded content sniffing over extension-only routing, especially for ZIP-like containers and nested archives.
-- Keep scanner routing metadata descriptor-owned in `scanner_registry_metadata.py`; header-format aliases, content-routed extensions, extension-only format policy, and lazy class exports should come from that descriptor module, with `can_handle()` as the final content gate.
+- Keep scanner routing metadata descriptor-owned in `scanner_registry_metadata.py`; header-format aliases, content-routed extensions, extension-only format policy, and lazy class exports should come from that descriptor module.
+- Keep trusted content routing decisions shared in `scanners/routing.py` so top-level, nested archive, and registry helper flows cannot disagree. Use `can_handle()` as the final gate for suffix-selected candidates; a strict bounded content route may deliberately own a renamed file even when a legacy suffix-only gate declines it.
 - Source discovery filters should consume the registry-backed scannable extension set instead of carrying local allowlists.
 - For routing, prefiltering, or archive-recursion changes, add one malicious positive regression and one benign near-match negative regression.
+- If bounded routing cannot distinguish formats safely, preserve the candidate for tentative analysis; reject disproven or optional-analyzer-unsupported candidates cleanly, and report an inconclusive outcome once an established analysis path cannot complete.
 - If a scanner aborts to avoid partial coverage, make the result operationally explicit (`success=False` with a clear error message) and preserve consistent exit-code and cache behavior.
 
 ## Scanner System

diff --git a/docs/agents/performance-audit.md b/docs/agents/performance-audit.md
@@ -32,6 +32,7 @@ The PR benchmark lane lives in:
 
 - `tests/benchmarks/test_scan_benchmarks.py`
 - `tests/benchmarks/test_picklescan_benchmarks.py`
+- `tests/test_performance_benchmarks.py` (`test_memory_usage_stability` cache-disabled guard only)
 - `.github/workflows/perf.yml`
 - `scripts/benchmark_report.py`
 
@@ -65,7 +66,10 @@ user-relevant workload or guards a security-critical hot path.
 
 The GitHub Actions performance workflow runs the benchmark suite on the PR base
 and head, posts a sticky summary comment, and uploads JSON plus Markdown
-artifacts. It is advisory: it reports regressions without blocking the PR.
+artifacts. It also runs the cache-disabled retained-memory stability guard from
+`tests/test_performance_benchmarks.py`; older timing-sensitive tests in that
+module remain outside the PR lane. The comparative benchmark report is
+advisory, while a failed retained-memory guard fails the workflow.
 
 ### Local Benchmark Run
 

diff --git a/docs/user/compatibility-matrix.md b/docs/user/compatibility-matrix.md
@@ -19,7 +19,7 @@ This page shows which model formats work in base install and which require optio
 | TensorFlow SavedModel/MetaGraph | `.pb`, `.meta`, SavedModel directories                            | Yes (vendored protos)                                     | `modelaudit[tensorflow]` on Python 3.11-3.12 for TensorFlow-dependent checkpoint/weight analysis |
 | Keras H5                        | `.h5`, `.hdf5`                                                    | No                                                        | `modelaudit[h5]` (required)                                                                      |
 | ONNX                            | `.onnx`                                                           | No                                                        | `modelaudit[onnx]` on Python 3.10-3.12 (required)                                                |
-| CoreML                          | `.mlmodel`                                                        | Yes (static protobuf/metadata checks)                     | None                                                                                             |
+| CoreML                          | `.mlmodel`, validated or bounded-candidate renamed artifacts      | Yes (static protobuf/metadata checks)                     | None                                                                                             |
 | NeMo                            | `.nemo`                                                           | Yes (static tar/config analysis, Hydra `_target_` checks) | None                                                                                             |
 | CNTK native                     | `.dnn`, `.cmf`                                                    | Yes (static signature and string analysis)                | None                                                                                             |
 | RKNN models                     | `.rknn`                                                           | Yes (static bounded metadata checks)                      | None                                                                                             |
@@ -45,6 +45,8 @@ This page shows which model formats work in base install and which require optio
 ## Notes
 
 - Scanner selection is extension- and content-aware; overlapping extensions may be dispatched to different scanners based on file content.
+- TensorFlow SavedModel/MetaGraph content routing recognizes renamed protobufs only after strict structural validation; oversized plausible candidates are retained for fail-closed bounded analysis.
+- CoreML content routing tentatively analyzes bounded protobuf candidates so unknown valid fields cannot hide custom-code or metadata findings.
 - Runtime scanner selection is available with `modelaudit scan --scanners ...` and `--exclude-scanner ...`; use `modelaudit scan --list-scanners` to discover scanner IDs.
 - Compressed wrappers enforce limits via `compressed_max_decompressed_bytes`, `compressed_max_decompression_ratio`, and `compressed_max_depth`.
 - R serialized (`.rds/.rda/.rdata`) support is static-only: ModelAudit does not execute R code or evaluate objects in an R runtime.