eellak
diff --git a/‎docs/architecture/artifact_layout_and_stage_handoffs.md‎
Lines changed: 26 additions & 0 deletions b/‎docs/architecture/artifact_layout_and_stage_handoffs.md‎
Lines changed: 26 additions & 0 deletions
diff --git a/‎docs/multi_gpu.md‎
Lines changed: 29 additions & 0 deletions b/‎docs/multi_gpu.md‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎docs/operations/DELETE_ME_deepseek_reliability_pending_2026-04-02.md‎
Lines changed: 39 additions & 0 deletions b/‎docs/operations/DELETE_ME_deepseek_reliability_pending_2026-04-02.md‎
Lines changed: 39 additions & 0 deletions
diff --git a/‎docs/stages/ocr.md‎
Lines changed: 13 additions & 0 deletions b/‎docs/stages/ocr.md‎
Lines changed: 13 additions & 0 deletions
@@ -92,6 +92,32 @@ That affects:
 
 Chunk suffix behavior is therefore part of the current contract.
 
+For DeepSeek OCR, there is an important distinction between execution-time shards and stage handoff artifacts:
+
+- Multi-GPU `exact_fill` may execute shards such as `doc__p00001-00096` internally to keep GPU lanes full.
+- Those shard names are operational artifacts, not the downstream contract for OCR outputs.
+- After worker completion, the runner reassembles canonical `markdown/<stem>.md` and `json/metrics/<stem>.metrics.json` files for each source PDF.
+- Canonical OCR markdown page boundaries are annotated with `<!-- page:N -->` comments next to the page-split marker, and the parser remains backward-compatible with legacy unnumbered separators.
+- Original shard markdown and shard metrics are moved under `sidecars/ocr_shards/` for debugging and audit trails.
+- If a repair retry trips the garbage cutoff again, the canonical markdown keeps the page slot but blanks the page content rather than preserving the bad first-pass OCR.
+
+For multi-GPU vLLM OCR, there is now a second class of operational artifacts under `sidecars/ocr_runtime/`:
+
+- `work_queue.sqlite`: durable batch queue state for the current OCR run
+- `worker_*.runtime.json`: per-worker heartbeat and timing state
+- `gpu_preflight.json`: GPU readiness checks such as persistence mode
+- `gpu_telemetry.jsonl`: sampled GPU utilization and process telemetry
+- `runtime_summary.json`: queue completion state plus steady-state timing windows
+
+The runtime queue now has two phases inside the same operational state:
+
+- first-pass shard batches
+- repair shard batches published after first pass completes
+
+These runtime artifacts are operational state, not downstream stage inputs. They are intended for monitoring, debugging, and safe resumption logic.
+
+Downstream stages should therefore consume canonical OCR outputs, not shard artifacts.
+
 ## Authoritative state vs derived artifacts
 
 Not every file has equal semantic importance.
 
@@ -32,6 +32,35 @@ c.ocr(use_gpus='multi', math_batch_size=12)
 - Crashed workers are respawned automatically; control the retry budget per GPU with `GLOSSAPI_MATH_RESPAWN_CAP` (default `5`). Use `GLOSSAPI_WORKER_LOG_VERBOSE=0` to silence the banner that prints the binding info.
 - When a device exceeds the respawn cap, remaining stems are added to the fatal skip-list and their artifacts are quarantined under `downloads/problematic_math/` and `json/problematic_math/` for follow-up.
 
+## DeepSeek OCR on Multiple GPUs
+
+```python
+from glossapi import Corpus
+c = Corpus("OUT", "OUT")
+c.ocr(
+    use_gpus="multi",
+    runtime_backend="vllm",
+    workers_per_gpu=1,
+    scheduler="exact_fill",
+    target_batch_pages=96,
+)
+```
+
+- `scheduler="exact_fill"` is the preferred multi-GPU vLLM scheduler when PDFs vary widely in length. It shards large documents into page ranges and keeps GPU lanes filled more evenly.
+- Internal shard runs now preserve the public `Corpus.ocr()` contract. Canonical outputs are reassembled back into `markdown/<stem>.md` and `json/metrics/<stem>.metrics.json` for each source PDF.
+- Shard markdown and shard metrics are retained for debugging under `sidecars/ocr_shards/` instead of remaining in the canonical handoff directories.
+- The vLLM path now renders pages into memory and feeds a bounded queue directly into inference, which removes the temporary PNG round-trip and overlaps rendering with generation.
+- Empty-page detection still happens before inference, and repair retries reuse the in-memory page image instead of reopening a file from disk.
+- Final OCR markdown now tags each page split with `<!-- page:N -->` so page images, markdown, and metrics stay aligned during inspection.
+- If a repair retry hits the garbage cutoff again, the page is blanked rather than keeping the failed first-pass garbage.
+- Multi-GPU vLLM workers now pull from a durable shared batch queue in `sidecars/ocr_runtime/work_queue.sqlite`, so finished batches survive worker crashes and respawned workers can continue without rescanning completed work.
+- Repair work now runs as a second global queue phase. First-pass batches finish and persist shard outputs first; then any worker can claim the queued repair shards. This keeps repair tails balanced across GPUs without mixing worker-local repair state into the controller.
+- Each worker writes `sidecars/ocr_runtime/worker_*.runtime.json` with heartbeat state and steady-state timing markers. The runner also emits `gpu_preflight.json`, `gpu_telemetry.jsonl`, and `runtime_summary.json`.
+- The runner checks GPU persistence mode before launch by default. Control it with `GLOSSAPI_DEEPSEEK_GPU_PREFLIGHT=off|warn|ensure`. The default is `ensure`, which will try `sudo -n nvidia-smi -pm 1` and record the result in `gpu_preflight.json`.
+- Worker reliability knobs are environment-driven: `GLOSSAPI_DEEPSEEK_WORKER_RESPAWN_CAP`, `GLOSSAPI_DEEPSEEK_WORK_ITEM_MAX_ATTEMPTS`, `GLOSSAPI_DEEPSEEK_WORK_STALE_AFTER_SEC`, `GLOSSAPI_DEEPSEEK_WORK_HEARTBEAT_SEC`, and `GLOSSAPI_DEEPSEEK_TELEMETRY_INTERVAL_SEC`.
+- The default `GLOSSAPI_DEEPSEEK_WORK_ITEM_MAX_ATTEMPTS=2` means one retry after the first failed claim, then the batch is marked failed instead of retrying forever.
+- `workers_per_gpu=1` remains the safe default on A100 40GB nodes. Prefer increasing `target_batch_pages` before adding more workers per device.
+
 ## Provider & Device Checks
 
 - ONNXRuntime providers must include `CUDAExecutionProvider`.
 
@@ -0,0 +1,39 @@
+# DELETE ME: DeepSeek Reliability Pending Work
+
+This note is temporary. Delete it after the first production soak confirms the
+merged reliability path is stable and the follow-up items below are either done
+or explicitly discarded.
+
+## What shipped in this merge
+
+- durable multi-GPU DeepSeek work queue with separate main and repair phases
+- worker respawn with process-group teardown so orphaned `VLLM::EngineCore`
+  processes do not pin VRAM after a crash
+- GPU preflight and telemetry sidecars under `sidecars/ocr_runtime/`
+- steady-state timing in the runtime summary
+- default work-item retry ceiling of two total attempts
+  - first failure: retry once
+  - second failure: mark the batch failed and stop retrying it
+
+## Pending follow-up
+
+1. Capture and archive one clean fault-injection receipt on the merged
+   `development` branch.
+   - Goal: preserve one explicit production-like run where a worker is killed
+     mid-run, the supervisor respawns it, the in-flight batch is retried once,
+     and the run still completes.
+
+2. Add operator-facing handling for terminally failed batches.
+   - The durable queue already marks them `failed`.
+   - The remaining work is a cleaner operator handoff, for example a dedicated
+     quarantine/export path or a documented replay workflow.
+
+3. Replace the current image-content stats implementation in
+   `run_pdf_ocr_vllm.py`.
+   - It still uses a CPU-heavy PIL pixel scan and currently emits a Pillow
+     deprecation warning.
+
+4. Run a longer unattended soak after merge.
+   - The current validation covers targeted tests, full end-to-end runs, and
+     reliability-path implementation, but production confidence still benefits
+     from a longer multi-hour burn-in on the merged branch.
@@ -41,6 +41,19 @@ OCR reruns should preserve:
 - explicit indication that remediation was attempted
 - visibility into files that remain problematic
 
+## DeepSeek runtime contract
+
+- `ocr()` may execute page-range shards internally when `use_gpus="multi"` and `scheduler="exact_fill"`, but the stage contract remains one canonical Markdown file and one canonical metrics file per source PDF.
+- When shard execution is used, the runner reassembles `markdown/<stem>.md` and `json/metrics/<stem>.metrics.json` after the CLI workers finish.
+- Execution-time shard artifacts are moved under `sidecars/ocr_shards/` so downstream stages do not mistake them for canonical stage outputs.
+- The vLLM runtime now streams rendered pages through an in-memory queue, overlaps rendering with inference, skips empty pages before inference, and reuses the same in-memory image for repair retries.
+- Canonical OCR markdown now annotates page boundaries with `<!-- page:N -->` comments alongside each page-split marker so downstream inspection can line up page images and markdown more easily.
+- In `repair_mode="auto"`, a page that trips the garbage cutoff again during the plain-OCR repair pass is now blanked instead of keeping the original garbage text.
+- Multi-GPU vLLM runs now execute through a durable shared batch queue rather than one fragile subprocess per preassigned lane. Workers claim first-pass batches dynamically, heartbeat while a batch is active, and can be respawned without losing finished batch outputs.
+- Repair retries are now durable too. Flagged pages are published back into the same runtime database as a second global repair queue, and any GPU worker can drain those repair shards after the first-pass queue is complete.
+- By default each durable batch gets at most two total attempts, so one retry is allowed after the first failure and then the batch is marked failed for operator follow-up.
+- Operational sidecars for these runs live under `sidecars/ocr_runtime/`, including the durable work queue state, per-worker runtime JSON, GPU telemetry samples, GPU preflight output, and a final runtime summary with steady-state inference timestamps.
+
 ## Contributor note
 
 Any change to candidate selection, skiplist semantics, or OCR-success metadata affects both rerun behavior and corpus analysis quality.