You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/architecture/artifact_layout_and_stage_handoffs.md
+26Lines changed: 26 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -92,6 +92,32 @@ That affects:
92
92
93
93
Chunk suffix behavior is therefore part of the current contract.
94
94
95
+
For DeepSeek OCR, there is an important distinction between execution-time shards and stage handoff artifacts:
96
+
97
+
- Multi-GPU `exact_fill` may execute shards such as `doc__p00001-00096` internally to keep GPU lanes full.
98
+
- Those shard names are operational artifacts, not the downstream contract for OCR outputs.
99
+
- After worker completion, the runner reassembles canonical `markdown/<stem>.md` and `json/metrics/<stem>.metrics.json` files for each source PDF.
100
+
- Canonical OCR markdown page boundaries are annotated with `<!-- page:N -->` comments next to the page-split marker, and the parser remains backward-compatible with legacy unnumbered separators.
101
+
- Original shard markdown and shard metrics are moved under `sidecars/ocr_shards/` for debugging and audit trails.
102
+
- If a repair retry trips the garbage cutoff again, the canonical markdown keeps the page slot but blanks the page content rather than preserving the bad first-pass OCR.
103
+
104
+
For multi-GPU vLLM OCR, there is now a second class of operational artifacts under `sidecars/ocr_runtime/`:
105
+
106
+
-`work_queue.sqlite`: durable batch queue state for the current OCR run
107
+
-`worker_*.runtime.json`: per-worker heartbeat and timing state
108
+
-`gpu_preflight.json`: GPU readiness checks such as persistence mode
109
+
-`gpu_telemetry.jsonl`: sampled GPU utilization and process telemetry
110
+
-`runtime_summary.json`: queue completion state plus steady-state timing windows
111
+
112
+
The runtime queue now has two phases inside the same operational state:
113
+
114
+
- first-pass shard batches
115
+
- repair shard batches published after first pass completes
116
+
117
+
These runtime artifacts are operational state, not downstream stage inputs. They are intended for monitoring, debugging, and safe resumption logic.
118
+
119
+
Downstream stages should therefore consume canonical OCR outputs, not shard artifacts.
- Crashed workers are respawned automatically; control the retry budget per GPU with `GLOSSAPI_MATH_RESPAWN_CAP` (default `5`). Use `GLOSSAPI_WORKER_LOG_VERBOSE=0` to silence the banner that prints the binding info.
33
33
- When a device exceeds the respawn cap, remaining stems are added to the fatal skip-list and their artifacts are quarantined under `downloads/problematic_math/` and `json/problematic_math/` for follow-up.
34
34
35
+
## DeepSeek OCR on Multiple GPUs
36
+
37
+
```python
38
+
from glossapi import Corpus
39
+
c = Corpus("OUT", "OUT")
40
+
c.ocr(
41
+
use_gpus="multi",
42
+
runtime_backend="vllm",
43
+
workers_per_gpu=1,
44
+
scheduler="exact_fill",
45
+
target_batch_pages=96,
46
+
)
47
+
```
48
+
49
+
-`scheduler="exact_fill"` is the preferred multi-GPU vLLM scheduler when PDFs vary widely in length. It shards large documents into page ranges and keeps GPU lanes filled more evenly.
50
+
- Internal shard runs now preserve the public `Corpus.ocr()` contract. Canonical outputs are reassembled back into `markdown/<stem>.md` and `json/metrics/<stem>.metrics.json` for each source PDF.
51
+
- Shard markdown and shard metrics are retained for debugging under `sidecars/ocr_shards/` instead of remaining in the canonical handoff directories.
52
+
- The vLLM path now renders pages into memory and feeds a bounded queue directly into inference, which removes the temporary PNG round-trip and overlaps rendering with generation.
53
+
- Empty-page detection still happens before inference, and repair retries reuse the in-memory page image instead of reopening a file from disk.
54
+
- Final OCR markdown now tags each page split with `<!-- page:N -->` so page images, markdown, and metrics stay aligned during inspection.
55
+
- If a repair retry hits the garbage cutoff again, the page is blanked rather than keeping the failed first-pass garbage.
56
+
- Multi-GPU vLLM workers now pull from a durable shared batch queue in `sidecars/ocr_runtime/work_queue.sqlite`, so finished batches survive worker crashes and respawned workers can continue without rescanning completed work.
57
+
- Repair work now runs as a second global queue phase. First-pass batches finish and persist shard outputs first; then any worker can claim the queued repair shards. This keeps repair tails balanced across GPUs without mixing worker-local repair state into the controller.
58
+
- Each worker writes `sidecars/ocr_runtime/worker_*.runtime.json` with heartbeat state and steady-state timing markers. The runner also emits `gpu_preflight.json`, `gpu_telemetry.jsonl`, and `runtime_summary.json`.
59
+
- The runner checks GPU persistence mode before launch by default. Control it with `GLOSSAPI_DEEPSEEK_GPU_PREFLIGHT=off|warn|ensure`. The default is `ensure`, which will try `sudo -n nvidia-smi -pm 1` and record the result in `gpu_preflight.json`.
60
+
- Worker reliability knobs are environment-driven: `GLOSSAPI_DEEPSEEK_WORKER_RESPAWN_CAP`, `GLOSSAPI_DEEPSEEK_WORK_ITEM_MAX_ATTEMPTS`, `GLOSSAPI_DEEPSEEK_WORK_STALE_AFTER_SEC`, `GLOSSAPI_DEEPSEEK_WORK_HEARTBEAT_SEC`, and `GLOSSAPI_DEEPSEEK_TELEMETRY_INTERVAL_SEC`.
61
+
- The default `GLOSSAPI_DEEPSEEK_WORK_ITEM_MAX_ATTEMPTS=2` means one retry after the first failed claim, then the batch is marked failed instead of retrying forever.
62
+
-`workers_per_gpu=1` remains the safe default on A100 40GB nodes. Prefer increasing `target_batch_pages` before adding more workers per device.
63
+
35
64
## Provider & Device Checks
36
65
37
66
- ONNXRuntime providers must include `CUDAExecutionProvider`.
Copy file name to clipboardExpand all lines: docs/stages/ocr.md
+13Lines changed: 13 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -41,6 +41,19 @@ OCR reruns should preserve:
41
41
- explicit indication that remediation was attempted
42
42
- visibility into files that remain problematic
43
43
44
+
## DeepSeek runtime contract
45
+
46
+
-`ocr()` may execute page-range shards internally when `use_gpus="multi"` and `scheduler="exact_fill"`, but the stage contract remains one canonical Markdown file and one canonical metrics file per source PDF.
47
+
- When shard execution is used, the runner reassembles `markdown/<stem>.md` and `json/metrics/<stem>.metrics.json` after the CLI workers finish.
48
+
- Execution-time shard artifacts are moved under `sidecars/ocr_shards/` so downstream stages do not mistake them for canonical stage outputs.
49
+
- The vLLM runtime now streams rendered pages through an in-memory queue, overlaps rendering with inference, skips empty pages before inference, and reuses the same in-memory image for repair retries.
50
+
- Canonical OCR markdown now annotates page boundaries with `<!-- page:N -->` comments alongside each page-split marker so downstream inspection can line up page images and markdown more easily.
51
+
- In `repair_mode="auto"`, a page that trips the garbage cutoff again during the plain-OCR repair pass is now blanked instead of keeping the original garbage text.
52
+
- Multi-GPU vLLM runs now execute through a durable shared batch queue rather than one fragile subprocess per preassigned lane. Workers claim first-pass batches dynamically, heartbeat while a batch is active, and can be respawned without losing finished batch outputs.
53
+
- Repair retries are now durable too. Flagged pages are published back into the same runtime database as a second global repair queue, and any GPU worker can drain those repair shards after the first-pass queue is complete.
54
+
- By default each durable batch gets at most two total attempts, so one retry is allowed after the first failure and then the batch is marked failed for operator follow-up.
55
+
- Operational sidecars for these runs live under `sidecars/ocr_runtime/`, including the durable work queue state, per-worker runtime JSON, GPU telemetry samples, GPU preflight output, and a final runtime summary with steady-state inference timestamps.
56
+
44
57
## Contributor note
45
58
46
59
Any change to candidate selection, skiplist semantics, or OCR-success metadata affects both rerun behavior and corpus analysis quality.
0 commit comments