Skip to content

perf(inference): eliminate ORT thread-pool spin-wait + synthetic test source#120

Merged
TCVinNYC merged 2 commits into
mainfrom
perf/cpu-multicam
Jun 26, 2026
Merged

perf(inference): eliminate ORT thread-pool spin-wait + synthetic test source#120
TCVinNYC merged 2 commits into
mainfrom
perf/cpu-multicam

Conversation

@TCVinNYC

Copy link
Copy Markdown
Member

Problem

With multiple cameras the app pegged CPU (~930% / 66% system on 4 real cameras) with bursty variance and frame drops. Profiling the live session (macOS `sample`) showed the #1 consumer was ORT thread-pool spin-wait (`ThreadPoolTempl::WorkerLoop`, 10,105 leaf samples; 88 ORT worker threads) — threads busy-spinning between intermittent inference runs, not actual compute. insightface's SCRFD+ArcFace sessions made it worse: `get_model()` forwards only `providers`, so they ran cores-wide + spinning, bypassing the per-camera thread cap.

Fix

  • `inference.py` — `_apply_low_idle_threading()` disables ORT intra/inter-op spinning (`session.intra_op.allow_spinning=0`) and forces a single sequential inter-op pool on every `make_session()` session (detector + pose). Pure scheduling change; identical inference results.
  • `identify.py` — `_capped_insightface_sessions()` scopes a patch of `onnxruntime.InferenceSession.init` to inject a capped, non-spinning `SessionOptions` during `FaceAnalysis` construction (the only hook insightface leaves). Guarded by a module lock since the degraded per-worker face-build path isn't otherwise serialised across camera threads.

Results (validated)

old fixed
Inference pool, headless A/B (4 cams, yolo11m, all services) 514% 273%
Real 4 cameras, full pipeline (PyCharm) ~930% / 66% sys ~411% / 29% sys

A 2.3× reduction, under the 30% system target, with the CPU variance gone. Live re-profile confirms `WorkerLoop` is eliminated from the hotspots.

Test / scaling infrastructure (enables headless multi-camera CPU validation, no camera permission)

  • `SyntheticAdapter` (`source type: synthetic`) — procedural / image / video synthetic source that pans the scene so detection/tracking/ego all run; effective-fps telemetry (`AUTOPTZ_SYNTH_DEBUG`).
  • `AUTOPTZ_DB_PATH` — run against an isolated config profile.
  • `AUTOPTZ_SKIP_CAMERA_PREFLIGHT` — start the engine with no local camera (NDI/RTSP/synthetic-only or headless), instead of gating all engine start on macOS camera permission.
  • `tests/test_synthetic_source.py` (9 tests).

All 1336 tests pass; ruff + mypy + selftest green.

🤖 Generated with Claude Code

TCVinNYC and others added 2 commits June 25, 2026 23:41
…ace pools

The dominant CPU consumer with multiple cameras was ORT thread-pool spin-wait,
not inference compute. A profiler (macOS sample) of a live 4-camera session
showed ThreadPoolTempl::WorkerLoop as the #1 leaf (10105 samples) with 88 ORT
worker threads — each session's intra-op pool busy-spins ~200ms before parking,
so several intermittently-run sessions (detect every Nth frame, pose/face a few
Hz) turn into a wall of idle CPU.

Fixes:
- inference.py: _apply_low_idle_threading() disables intra/inter-op spinning and
  forces a single sequential inter-op pool on every make_session() session
  (detector + pose). Pure scheduling change; no effect on results.
- identify.py: insightface's get_model() forwards only providers (no SessionOptions),
  so its SCRFD+ArcFace sessions ran cores-wide + spinning, bypassing the cap.
  _capped_insightface_sessions() scopes a patch of InferenceSession.__init__ to
  inject a capped, non-spinning SessionOptions during FaceAnalysis construction,
  guarded by a module lock (the degraded per-worker face-build path is not
  otherwise serialised across camera threads).

Headless A/B (4 cams, yolo11m, all services): inference pool 514% -> 273% CPU.
Live-GUI profile confirms WorkerLoop eliminated from the hotspots.

Test/scaling infra (enables headless multi-camera CPU validation, no camera perm):
- ingest.py: SyntheticAdapter — procedural / image / video synthetic source that
  pans the scene so detection/tracking/ego all run; effective-fps telemetry.
- frame_source.py + models.py: wire source type "synthetic".
- store.py: AUTOPTZ_DB_PATH override to run against an isolated profile.
- app.py: AUTOPTZ_SKIP_CAMERA_PREFLIGHT to start the engine without a local
  camera (NDI/RTSP/synthetic-only or headless runs).
- tests/test_synthetic_source.py.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@TCVinNYC TCVinNYC merged commit 418bf66 into main Jun 26, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant