perf(inference): eliminate ORT thread-pool spin-wait + synthetic test source#120
Merged
Conversation
…ace pools The dominant CPU consumer with multiple cameras was ORT thread-pool spin-wait, not inference compute. A profiler (macOS sample) of a live 4-camera session showed ThreadPoolTempl::WorkerLoop as the #1 leaf (10105 samples) with 88 ORT worker threads — each session's intra-op pool busy-spins ~200ms before parking, so several intermittently-run sessions (detect every Nth frame, pose/face a few Hz) turn into a wall of idle CPU. Fixes: - inference.py: _apply_low_idle_threading() disables intra/inter-op spinning and forces a single sequential inter-op pool on every make_session() session (detector + pose). Pure scheduling change; no effect on results. - identify.py: insightface's get_model() forwards only providers (no SessionOptions), so its SCRFD+ArcFace sessions ran cores-wide + spinning, bypassing the cap. _capped_insightface_sessions() scopes a patch of InferenceSession.__init__ to inject a capped, non-spinning SessionOptions during FaceAnalysis construction, guarded by a module lock (the degraded per-worker face-build path is not otherwise serialised across camera threads). Headless A/B (4 cams, yolo11m, all services): inference pool 514% -> 273% CPU. Live-GUI profile confirms WorkerLoop eliminated from the hotspots. Test/scaling infra (enables headless multi-camera CPU validation, no camera perm): - ingest.py: SyntheticAdapter — procedural / image / video synthetic source that pans the scene so detection/tracking/ego all run; effective-fps telemetry. - frame_source.py + models.py: wire source type "synthetic". - store.py: AUTOPTZ_DB_PATH override to run against an isolated profile. - app.py: AUTOPTZ_SKIP_CAMERA_PREFLIGHT to start the engine without a local camera (NDI/RTSP/synthetic-only or headless runs). - tests/test_synthetic_source.py. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
With multiple cameras the app pegged CPU (~930% / 66% system on 4 real cameras) with bursty variance and frame drops. Profiling the live session (macOS `sample`) showed the #1 consumer was ORT thread-pool spin-wait (`ThreadPoolTempl::WorkerLoop`, 10,105 leaf samples; 88 ORT worker threads) — threads busy-spinning between intermittent inference runs, not actual compute. insightface's SCRFD+ArcFace sessions made it worse: `get_model()` forwards only `providers`, so they ran cores-wide + spinning, bypassing the per-camera thread cap.
Fix
Results (validated)
A 2.3× reduction, under the 30% system target, with the CPU variance gone. Live re-profile confirms `WorkerLoop` is eliminated from the hotspots.
Test / scaling infrastructure (enables headless multi-camera CPU validation, no camera permission)
All 1336 tests pass; ruff + mypy + selftest green.
🤖 Generated with Claude Code