Skip to content

Kheiss/up ovr#1861

Closed
kheiss-uwzoo wants to merge 106 commits into
mainfrom
kheiss/up-ovr
Closed

Kheiss/up ovr#1861
kheiss-uwzoo wants to merge 106 commits into
mainfrom
kheiss/up-ovr

Conversation

@kheiss-uwzoo
Copy link
Copy Markdown
Collaborator

NVIDIA NeMo Retriever Library is a scalable, performance-oriented framework for document content and metadata extraction. It supports both NVIDIA NIM microservices and a wide range of models to find, contextualize, and extract text, tables, charts, and infographics for use in downstream generative and retrieval-augmented applications.

kheiss-uwzoo and others added 30 commits February 19, 2026 10:36
Update all hardcoded version references from 26.1.2 to 26.3.0-RC1
across helm charts, docker-compose, FastAPI, docs, and examples.

Made-with: Cursor
Co-authored-by: Kurt Heiss <kheiss@nvidia.com>
Co-authored-by: Jeremy Dyer <jdye64@gmail.com>
…ing long VLM captioning

Large PDFs with VLM captioning enabled can take 2-22+ hours depending on hardware.
The previous defaults (STATE_TTL=7200s, RESULT_DATA_TTL=3600s) caused job state to
expire mid-processing, resulting in 404 "Job ID not found or state has expired" errors
even though the pipeline completed successfully.

Raises both defaults to 172800s (48 hours), providing sufficient headroom for all
observed workloads. Users can still override via RESULT_DATA_TTL_SECONDS and
STATE_TTL_SECONDS environment variables.

Fixes: Customer bug 5914605

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kheiss-uwzoo kheiss-uwzoo requested review from a team as code owners April 15, 2026 20:54
@kheiss-uwzoo kheiss-uwzoo requested a review from jioffe502 April 15, 2026 20:54
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 15, 2026

Greptile Summary

This PR prepends new introductory content to docs/docs/extraction/overview.md: a rename notice (nv-ingest → NeMo Retriever Library), a deprecation note for Cached/Deplot, and a revised high-level description paragraph. However, the original introductory paragraphs were not removed, leaving the document with two conflicting intro blocks that differ in wording, scope of listed capabilities, and whether embedding/storage steps are described as optional.

Confidence Score: 3/5

The duplicate and contradictory intro blocks will confuse readers; the old paragraphs should be removed or merged before merging.

A P1 documentation correctness issue remains: two competing introductory sections with inconsistent descriptions of the library's capabilities. This directly harms the user-facing docs and should be resolved before merging.

docs/docs/extraction/overview.md — lines 17–20 (original intro) conflict with the newly added lines 3–11.

Important Files Changed

Filename Overview
docs/docs/extraction/overview.md New introductory block (rename note, deprecation note, revised description) inserted before the existing intro, leaving two conflicting intro paragraphs with inconsistent wording about library capabilities.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["overview.md (before PR)"] --> B["Intro paragraph\n(high retrieval accuracy...)"]
    B --> C["Parallelization paragraph\n(manages embeddings, stores to LanceDB)"]
    C --> D["What NeMo Retriever Library Is ✔️"]
    
    E["overview.md (after PR)"] --> F["NEW: Intro paragraph\n(scalable, performance-oriented...)"]
    F --> G["NEW: Rename note (nv-ingest → NeMo Retriever Library)"]
    G --> H["NEW: Parallelization paragraph\n(optionally manages embeddings, LanceDB or Milvus)"]
    H --> I["NEW: Deprecation note (Cached/Deplot)"]
    I --> J["OLD: Intro paragraph\n(high retrieval accuracy...) ⚠️ DUPLICATE"]
    J --> K["OLD: Parallelization paragraph\n(manages, stores) ⚠️ CONFLICTS"]
    K --> L["What NeMo Retriever Library Is ✔️"]
    
    style J fill:#ffcccc
    style K fill:#ffcccc
Loading

Comments Outside Diff (1)

  1. docs/docs/extraction/overview.md, line 17-20 (link)

    P1 Duplicate and conflicting introductory content

    The newly added paragraphs (lines 3–11) introduce NeMo Retriever Library in a way that is nearly duplicate — but subtly inconsistent — with the original paragraphs that still remain here. Line 3 calls it "scalable, performance-oriented" while line 17 calls it "high retrieval accuracy, performant, and scalable"; lines 9–11 say the library "can optionally manage" embedding and storage, while lines 19–20 say it "manages" and "stores into". Readers will encounter two competing intro sections with contradictory details about the library's capabilities and listed file types. The old paragraph block (lines 17–20) should be removed or merged into the new intro.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: docs/docs/extraction/overview.md
    Line: 17-20
    
    Comment:
    **Duplicate and conflicting introductory content**
    
    The newly added paragraphs (lines 3–11) introduce NeMo Retriever Library in a way that is nearly duplicate — but subtly inconsistent — with the original paragraphs that still remain here. Line 3 calls it "scalable, performance-oriented" while line 17 calls it "high retrieval accuracy, performant, and scalable"; lines 9–11 say the library "can optionally manage" embedding and storage, while lines 19–20 say it "manages" and "stores into". Readers will encounter two competing intro sections with contradictory details about the library's capabilities and listed file types. The old paragraph block (lines 17–20) should be removed or merged into the new intro.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: docs/docs/extraction/overview.md
Line: 17-20

Comment:
**Duplicate and conflicting introductory content**

The newly added paragraphs (lines 3–11) introduce NeMo Retriever Library in a way that is nearly duplicate — but subtly inconsistent — with the original paragraphs that still remain here. Line 3 calls it "scalable, performance-oriented" while line 17 calls it "high retrieval accuracy, performant, and scalable"; lines 9–11 say the library "can optionally manage" embedding and storage, while lines 19–20 say it "manages" and "stores into". Readers will encounter two competing intro sections with contradictory details about the library's capabilities and listed file types. The old paragraph block (lines 17–20) should be removed or merged into the new intro.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: docs/docs/extraction/overview.md
Line: 9

Comment:
**Missing hyphen in compound modifier**

"well defined" modifying "JSON schema" is a compound adjective and should be hyphenated.

```suggestion
NeMo Retriever Library enables parallelization of splitting documents into pages where artifacts are classified (such as text, tables, charts, and infographics), extracted, and further contextualized through optical character recognition (OCR) into a well-defined JSON schema. 
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (4): Last reviewed commit: "Merge branch 'main' into kheiss/up-ovr" | Re-trigger Greptile

Comment thread docs/docs/extraction/overview.md Outdated
@jdye64
Copy link
Copy Markdown
Collaborator

jdye64 commented Apr 22, 2026

@kheiss-uwzoo approved but merge conflicts that need resolved before I can merge

@kheiss-uwzoo kheiss-uwzoo deleted the kheiss/up-ovr branch April 30, 2026 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants