Skip to content

Commit 489698e

Browse files
committed
deepseek reliability hardening
1 parent d143e60 commit 489698e

10 files changed

Lines changed: 2809 additions & 201 deletions

docs/architecture/artifact_layout_and_stage_handoffs.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,32 @@ That affects:
9292

9393
Chunk suffix behavior is therefore part of the current contract.
9494

95+
For DeepSeek OCR, there is an important distinction between execution-time shards and stage handoff artifacts:
96+
97+
- Multi-GPU `exact_fill` may execute shards such as `doc__p00001-00096` internally to keep GPU lanes full.
98+
- Those shard names are operational artifacts, not the downstream contract for OCR outputs.
99+
- After worker completion, the runner reassembles canonical `markdown/<stem>.md` and `json/metrics/<stem>.metrics.json` files for each source PDF.
100+
- Canonical OCR markdown page boundaries are annotated with `<!-- page:N -->` comments next to the page-split marker, and the parser remains backward-compatible with legacy unnumbered separators.
101+
- Original shard markdown and shard metrics are moved under `sidecars/ocr_shards/` for debugging and audit trails.
102+
- If a repair retry trips the garbage cutoff again, the canonical markdown keeps the page slot but blanks the page content rather than preserving the bad first-pass OCR.
103+
104+
For multi-GPU vLLM OCR, there is now a second class of operational artifacts under `sidecars/ocr_runtime/`:
105+
106+
- `work_queue.sqlite`: durable batch queue state for the current OCR run
107+
- `worker_*.runtime.json`: per-worker heartbeat and timing state
108+
- `gpu_preflight.json`: GPU readiness checks such as persistence mode
109+
- `gpu_telemetry.jsonl`: sampled GPU utilization and process telemetry
110+
- `runtime_summary.json`: queue completion state plus steady-state timing windows
111+
112+
The runtime queue now has two phases inside the same operational state:
113+
114+
- first-pass shard batches
115+
- repair shard batches published after first pass completes
116+
117+
These runtime artifacts are operational state, not downstream stage inputs. They are intended for monitoring, debugging, and safe resumption logic.
118+
119+
Downstream stages should therefore consume canonical OCR outputs, not shard artifacts.
120+
95121
## Authoritative state vs derived artifacts
96122

97123
Not every file has equal semantic importance.

docs/multi_gpu.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,35 @@ c.ocr(use_gpus='multi', math_batch_size=12)
3232
- Crashed workers are respawned automatically; control the retry budget per GPU with `GLOSSAPI_MATH_RESPAWN_CAP` (default `5`). Use `GLOSSAPI_WORKER_LOG_VERBOSE=0` to silence the banner that prints the binding info.
3333
- When a device exceeds the respawn cap, remaining stems are added to the fatal skip-list and their artifacts are quarantined under `downloads/problematic_math/` and `json/problematic_math/` for follow-up.
3434

35+
## DeepSeek OCR on Multiple GPUs
36+
37+
```python
38+
from glossapi import Corpus
39+
c = Corpus("OUT", "OUT")
40+
c.ocr(
41+
use_gpus="multi",
42+
runtime_backend="vllm",
43+
workers_per_gpu=1,
44+
scheduler="exact_fill",
45+
target_batch_pages=96,
46+
)
47+
```
48+
49+
- `scheduler="exact_fill"` is the preferred multi-GPU vLLM scheduler when PDFs vary widely in length. It shards large documents into page ranges and keeps GPU lanes filled more evenly.
50+
- Internal shard runs now preserve the public `Corpus.ocr()` contract. Canonical outputs are reassembled back into `markdown/<stem>.md` and `json/metrics/<stem>.metrics.json` for each source PDF.
51+
- Shard markdown and shard metrics are retained for debugging under `sidecars/ocr_shards/` instead of remaining in the canonical handoff directories.
52+
- The vLLM path now renders pages into memory and feeds a bounded queue directly into inference, which removes the temporary PNG round-trip and overlaps rendering with generation.
53+
- Empty-page detection still happens before inference, and repair retries reuse the in-memory page image instead of reopening a file from disk.
54+
- Final OCR markdown now tags each page split with `<!-- page:N -->` so page images, markdown, and metrics stay aligned during inspection.
55+
- If a repair retry hits the garbage cutoff again, the page is blanked rather than keeping the failed first-pass garbage.
56+
- Multi-GPU vLLM workers now pull from a durable shared batch queue in `sidecars/ocr_runtime/work_queue.sqlite`, so finished batches survive worker crashes and respawned workers can continue without rescanning completed work.
57+
- Repair work now runs as a second global queue phase. First-pass batches finish and persist shard outputs first; then any worker can claim the queued repair shards. This keeps repair tails balanced across GPUs without mixing worker-local repair state into the controller.
58+
- Each worker writes `sidecars/ocr_runtime/worker_*.runtime.json` with heartbeat state and steady-state timing markers. The runner also emits `gpu_preflight.json`, `gpu_telemetry.jsonl`, and `runtime_summary.json`.
59+
- The runner checks GPU persistence mode before launch by default. Control it with `GLOSSAPI_DEEPSEEK_GPU_PREFLIGHT=off|warn|ensure`. The default is `ensure`, which will try `sudo -n nvidia-smi -pm 1` and record the result in `gpu_preflight.json`.
60+
- Worker reliability knobs are environment-driven: `GLOSSAPI_DEEPSEEK_WORKER_RESPAWN_CAP`, `GLOSSAPI_DEEPSEEK_WORK_ITEM_MAX_ATTEMPTS`, `GLOSSAPI_DEEPSEEK_WORK_STALE_AFTER_SEC`, `GLOSSAPI_DEEPSEEK_WORK_HEARTBEAT_SEC`, and `GLOSSAPI_DEEPSEEK_TELEMETRY_INTERVAL_SEC`.
61+
- The default `GLOSSAPI_DEEPSEEK_WORK_ITEM_MAX_ATTEMPTS=2` means one retry after the first failed claim, then the batch is marked failed instead of retrying forever.
62+
- `workers_per_gpu=1` remains the safe default on A100 40GB nodes. Prefer increasing `target_batch_pages` before adding more workers per device.
63+
3564
## Provider & Device Checks
3665

3766
- ONNXRuntime providers must include `CUDAExecutionProvider`.
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# DELETE ME: DeepSeek Reliability Pending Work
2+
3+
This note is temporary. Delete it after the first production soak confirms the
4+
merged reliability path is stable and the follow-up items below are either done
5+
or explicitly discarded.
6+
7+
## What shipped in this merge
8+
9+
- durable multi-GPU DeepSeek work queue with separate main and repair phases
10+
- worker respawn with process-group teardown so orphaned `VLLM::EngineCore`
11+
processes do not pin VRAM after a crash
12+
- GPU preflight and telemetry sidecars under `sidecars/ocr_runtime/`
13+
- steady-state timing in the runtime summary
14+
- default work-item retry ceiling of two total attempts
15+
- first failure: retry once
16+
- second failure: mark the batch failed and stop retrying it
17+
18+
## Pending follow-up
19+
20+
1. Capture and archive one clean fault-injection receipt on the merged
21+
`development` branch.
22+
- Goal: preserve one explicit production-like run where a worker is killed
23+
mid-run, the supervisor respawns it, the in-flight batch is retried once,
24+
and the run still completes.
25+
26+
2. Add operator-facing handling for terminally failed batches.
27+
- The durable queue already marks them `failed`.
28+
- The remaining work is a cleaner operator handoff, for example a dedicated
29+
quarantine/export path or a documented replay workflow.
30+
31+
3. Replace the current image-content stats implementation in
32+
`run_pdf_ocr_vllm.py`.
33+
- It still uses a CPU-heavy PIL pixel scan and currently emits a Pillow
34+
deprecation warning.
35+
36+
4. Run a longer unattended soak after merge.
37+
- The current validation covers targeted tests, full end-to-end runs, and
38+
reliability-path implementation, but production confidence still benefits
39+
from a longer multi-hour burn-in on the merged branch.

docs/stages/ocr.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,19 @@ OCR reruns should preserve:
4141
- explicit indication that remediation was attempted
4242
- visibility into files that remain problematic
4343

44+
## DeepSeek runtime contract
45+
46+
- `ocr()` may execute page-range shards internally when `use_gpus="multi"` and `scheduler="exact_fill"`, but the stage contract remains one canonical Markdown file and one canonical metrics file per source PDF.
47+
- When shard execution is used, the runner reassembles `markdown/<stem>.md` and `json/metrics/<stem>.metrics.json` after the CLI workers finish.
48+
- Execution-time shard artifacts are moved under `sidecars/ocr_shards/` so downstream stages do not mistake them for canonical stage outputs.
49+
- The vLLM runtime now streams rendered pages through an in-memory queue, overlaps rendering with inference, skips empty pages before inference, and reuses the same in-memory image for repair retries.
50+
- Canonical OCR markdown now annotates page boundaries with `<!-- page:N -->` comments alongside each page-split marker so downstream inspection can line up page images and markdown more easily.
51+
- In `repair_mode="auto"`, a page that trips the garbage cutoff again during the plain-OCR repair pass is now blanked instead of keeping the original garbage text.
52+
- Multi-GPU vLLM runs now execute through a durable shared batch queue rather than one fragile subprocess per preassigned lane. Workers claim first-pass batches dynamically, heartbeat while a batch is active, and can be respawned without losing finished batch outputs.
53+
- Repair retries are now durable too. Flagged pages are published back into the same runtime database as a second global repair queue, and any GPU worker can drain those repair shards after the first-pass queue is complete.
54+
- By default each durable batch gets at most two total attempts, so one retry is allowed after the first failure and then the batch is marked failed for operator follow-up.
55+
- Operational sidecars for these runs live under `sidecars/ocr_runtime/`, including the durable work queue state, per-worker runtime JSON, GPU telemetry samples, GPU preflight output, and a final runtime summary with steady-state inference timestamps.
56+
4457
## Contributor note
4558

4659
Any change to candidate selection, skiplist semantics, or OCR-success metadata affects both rerun behavior and corpus analysis quality.

0 commit comments

Comments
 (0)