feat(HNT-2108): add article extraction handler with Corpus API client by mmiermans · Pull Request #8 · mozilla/hnt-content

mmiermans · 2026-04-16T03:36:18Z

Goal

Implement the article extraction handler for the crawl worker. When a crawl-article Pub/Sub message arrives, the handler calls Zyte's article extraction API, maps the response to the BigQuery articles event schema, and (for live articles with a corpus_item) detects title/excerpt changes and syncs them to the Curated Corpus Admin API via GraphQL.

New packages added to crawl-common:

corpus-api/: GraphQL client for updateApprovedCorpusItem with RS256 JWT auth (jose), in-process token caching with 5-min TTL and 95% refresh buffer, and p-retry with exponential backoff.
types/: TypeScript interfaces for CrawlArticleMessage (Pub/Sub input) and ArticleEvent (BigQuery output).
utils/: normalizeText (ported from Python diff.py).

Implementation decisions

Decision	Approach	Why
Text normalization	Port of Python `diff.py:normalize_text()` with identical step ordering (NFC, trim, truncate, strip periods, collapse whitespace, lowercase, normalize quotes)	Exact parity with the existing system avoids false-positive change detections during shadow mode
Excerpt comparison length	Truncate to 255 chars before comparing	Matches the Corpus API's backend storage limit and the Python implementation
Changed-field-only updates	`buildUpdateInput` only overrides title or excerpt when the normalized comparison actually detected a change	Prevents cosmetic differences (smart quotes, trailing periods) from overwriting curator-edited values
Retry budget	4 retries with 2s-16s exponential backoff (~30s worst case)	Fits well within the 600s Pub/Sub ack deadline while giving transient errors time to recover
Module-level state	`initCorpusApiClient()` + module-level `let` variables	Matches the existing Zyte client pattern in this repo
Runtime validation	Out of scope — message and event types are plain TS interfaces, no Zod or equivalent	Ticket does not require validation; tracked separately in HNT-2487. Library evaluation preserved in the collapsed section below as a starting point

GraphQL client: raw fetch

The corpus-api client calls a single GraphQL mutation (updateApprovedCorpusItem) using fetch() with an inline mutation string and manual TypeScript types. We considered three approaches:

Approach	Pros	Cons
Raw fetch (chosen)	Zero dependencies; matches content-monorepo lambda pattern (`lambda-common/graphQlApiCalls.ts`); simple for a single mutation; full control over headers, retries, and error handling	No schema validation at build time; mutation string could drift from the API schema
@apollo/client + graphql-codegen	Type-safe queries generated from the live schema; used by curation-admin-tools (React frontend)	Heavy dependency (~200 kB); designed for frontend caching/hooks; overkill for a backend service calling one mutation
graphql-request	Lightweight GraphQL client with TypeScript support	Extra dependency for minimal benefit over raw fetch; no codegen without additional tooling

We chose raw fetch because the crawler only calls one mutation, and this matches the established backend convention in content-monorepo's lambdas (corpus-scheduler-lambda, section-manager-lambda). Those lambdas use the same pattern: fetch() POST with { query, variables } body, manual Authorization and apollographql-client-name headers, and hand-written TypeScript types. curation-admin-tools uses Apollo Client + codegen, but that pattern serves a React frontend with many queries/mutations and caching needs.

JWT library: jose

The Corpus API client authenticates with RS256 JWTs signed from a JWK private key stored in Google Secret Manager. content-monorepo's lambda-common uses jsonwebtoken + jwk-to-pem for the same purpose (signing JWTs to call the Corpus Admin API from Lambda functions). curation-admin-tools does not sign JWTs; it receives OAuth2 ID tokens from Mozilla's auth provider and passes them as Bearer tokens.

Legend: 🟢 meets requirement · 🟡 meets with caveat · 🔴 does not meet.

Library	Pure ESM	Native JWK	Single package	Notes
jose (chosen)	🟢	🟢	🟢	Built on Web Crypto; no native addons; `importJWK` takes the JWK directly.
jsonwebtoken + jwk-to-pem	🟡	🔴	🔴	CommonJS-first; requires `require()` interop in ESM. JWK must first be converted to PEM via `jwk-to-pem`. Two packages for one signing flow.

We chose jose because this is a pure ESM monorepo targeting Node 24, and jose's native importJWK eliminates the JWK-to-PEM conversion step entirely. content-monorepo uses jsonwebtoken because it predates the ESM transition and runs in Lambda where CommonJS is the default; if those lambdas were being written today, jose would be the better fit there too.

Validation library evaluation (deferred)

This PR does not wire up runtime validation; the Pub/Sub input and BigQuery event types are plain TypeScript interfaces. The evaluation below was done during implementation and is retained for whichever follow-up PR wires validation to the Pub/Sub receive path and/or the articles publish path.

All candidates exceed the pipeline's ~10 validations/sec peak by 5+ orders of magnitude, so throughput was not a differentiator.

Legend: 🟢 meets requirement · 🟡 meets with caveat · 🔴 does not meet.

Library	Strong TS inference	Pure ESM	Structured errors	Notes
Zod v4 (leaning)	🟢	🟢	🟢	Strongest ecosystem (42k stars); best error reporting for poison-message logs.
Typia	🟢	🔴	🟡	Requires the `ts-patch` compiler plugin; forward-compat risk with TypeScript's Go port (tsgo). Errors are readable but less structured than Zod's issue paths.
Valibot	🟢	🟢	🟢	Much smaller ecosystem and adoption than Zod.
ArkType	🟢	🟢	🟢	Significantly smaller adoption for a shared monorepo dependency.
AJV + TypeBox	🟡	🟢	🔴	TypeBox needed as a second package to get TS inference from JSON Schema. Fail-fast mode returns a boolean + separate errors array, which is less ergonomic for structured logs.

The benchmark validated real articles and article_discoveries event shapes from this pipeline; the x-axis is validations per second on a log scale, and the dashed line marks our ~10/sec peak requirement.

Zod looks like the best fit because it provides the cleanest schema-first decode pipeline for Pub/Sub boundaries (JSON string in, typed object out), has the strongest error reporting for diagnosing poison messages in production logs, and introduces zero toolchain complexity in a pure ESM monorepo. Typia's ts-patch requirement is a forward-compatibility risk as TypeScript moves toward its Go-based compiler, and its transform/coercion story is less natural for boundary decoding than Zod's schema-driven approach.

Implement the article extraction handler that calls Zyte, maps the response to the articles event schema, detects title/excerpt changes for live articles, and syncs them to the Curated Corpus Admin API. New modules in crawl-common: text normalization (ported from Python diff.py), SHA-256 url hashing, Pub/Sub message types with Zod schemas, and a Corpus Admin API GraphQL client with RS256 JWT auth via jose. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Add fetch timeout, null-data guard, and empty-JWK guard to the Corpus API client. Fix buildUpdateInput to only send changed fields, preventing cosmetic overwrites of curator edits. Move retry test to integration file with fake timers (2s -> instant) and add token refresh coverage. Consolidate duplicate tests with it.each. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Remove urlHash (unused in this PR; belongs in Task 7.1 when Redis state client is implemented). Extract shared corpus-api test fixtures into test-helpers.ts to eliminate duplication between unit and integration tests. Clarify normalizeText doc block. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Runtime validation is not required by the ticket and the schemas were unused at runtime (only z.infer for type derivation). Defer the validation library choice to whichever PR wires up Pub/Sub message parsing for real. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

mmiermans · 2026-04-21T23:06:54Z

+  published_at?: string;
+  breadcrumbs?: ArticleBreadcrumb[];
+  language?: string;
+}


This may be a good place to validate the schema, but I created a separate ticket to keep the scope of this PR manageable.

mmiermans · 2026-04-21T23:33:50Z

+    extractArticle: vi.fn(),
+    updateApprovedCorpusItem: vi.fn(),
+  };
+});


This apparently needs to happen before the following imports. I see a similar pattern in some content-monorepo modules, including lambdas/section-manager-lambda/src/graphQlApiCalls.spec.ts, so I guess this is conventional?

- Handler integration test now stubs fetch via vi.stubGlobal and exercises the real Zyte and Corpus API clients end-to-end (JWT signing, GraphQL body serialization, response parsing), instead of module-mocking like the unit tests. - Extract shared fixtures (ZYTE_ARTICLE, BASE_MESSAGE, CORPUS_ITEM, TEST_JWK) into services/crawl-worker/src/handlers/test-helpers.ts. - Rename SAMPLE_INPUT / SUCCESS_BODY to UPDATE_APPROVED_CORPUS_ITEM_* so the next mutation's fixtures don't collide. - Rename the one integration test that broke the 'it <verb>' convention. - Drop the redundant camelCase comment from UpdateApprovedCorpusItemInput's doc block. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

mmiermans · 2026-04-22T17:29:39Z

+      },
+    },
+  );
+}


Eventually we probably will want to break this file up, or introduce abstractions, but I figured we could start simple.

- Replace literal assertion values with references to their source fixture (ZYTE_ARTICLE, CORPUS_ITEM, CLIENT_OPTS, UPDATE_APPROVED_CORPUS_ITEM_INPUT/_SUCCESS_BODY) so fixture changes do not silently break test intent. - Type the integration test's fetch mock via vi.fn<typeof fetch>() to match the sibling pattern in corpus-api/client.integration.ts and drop the as RequestInit cast. - Export TOKEN_REFRESH_WINDOW_MS so the token-refresh tests derive the window from client config instead of re-hardcoding 285s. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

mmiermans and others added 5 commits April 15, 2026 19:36

fix(HNT-2108): remove unused ZyteResponse import

d622ec3

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

mmiermans commented Apr 21, 2026

View reviewed changes

mmiermans and others added 2 commits April 22, 2026 09:12

chore(HNT-2108): apply Prettier formatting

140e98c

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

mmiermans commented Apr 22, 2026

View reviewed changes

mmiermans marked this pull request as ready for review April 22, 2026 17:59

mmiermans merged commit 8917542 into main Apr 22, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(HNT-2108): add article extraction handler with Corpus API client#8

feat(HNT-2108): add article extraction handler with Corpus API client#8
mmiermans merged 8 commits intomainfrom
HNT-2108-article-handler

mmiermans commented Apr 16, 2026 •

edited

Loading

Uh oh!

mmiermans Apr 21, 2026

Uh oh!

mmiermans Apr 21, 2026

Uh oh!

mmiermans Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mmiermans commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goal

Implementation decisions

GraphQL client: raw fetch

JWT library: jose

Uh oh!

mmiermans Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

mmiermans Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

mmiermans Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mmiermans commented Apr 16, 2026 •

edited

Loading