Skip to content

feat(HNT-2108): add article extraction handler with Corpus API client#8

Merged
mmiermans merged 8 commits intomainfrom
HNT-2108-article-handler
Apr 22, 2026
Merged

feat(HNT-2108): add article extraction handler with Corpus API client#8
mmiermans merged 8 commits intomainfrom
HNT-2108-article-handler

Conversation

@mmiermans
Copy link
Copy Markdown
Collaborator

@mmiermans mmiermans commented Apr 16, 2026

Goal

HNT-2108

Implement the article extraction handler for the crawl worker. When a crawl-article Pub/Sub message arrives, the handler calls Zyte's article extraction API, maps the response to the BigQuery articles event schema, and (for live articles with a corpus_item) detects title/excerpt changes and syncs them to the Curated Corpus Admin API via GraphQL.

New packages added to crawl-common:

  • corpus-api/: GraphQL client for updateApprovedCorpusItem with RS256 JWT auth (jose), in-process token caching with 5-min TTL and 95% refresh buffer, and p-retry with exponential backoff.
  • types/: TypeScript interfaces for CrawlArticleMessage (Pub/Sub input) and ArticleEvent (BigQuery output).
  • utils/: normalizeText (ported from Python diff.py).

Implementation decisions

Decision Approach Why
Text normalization Port of Python diff.py:normalize_text() with identical step ordering (NFC, trim, truncate, strip periods, collapse whitespace, lowercase, normalize quotes) Exact parity with the existing system avoids false-positive change detections during shadow mode
Excerpt comparison length Truncate to 255 chars before comparing Matches the Corpus API's backend storage limit and the Python implementation
Changed-field-only updates buildUpdateInput only overrides title or excerpt when the normalized comparison actually detected a change Prevents cosmetic differences (smart quotes, trailing periods) from overwriting curator-edited values
Retry budget 4 retries with 2s-16s exponential backoff (~30s worst case) Fits well within the 600s Pub/Sub ack deadline while giving transient errors time to recover
Module-level state initCorpusApiClient() + module-level let variables Matches the existing Zyte client pattern in this repo
Runtime validation Out of scope — message and event types are plain TS interfaces, no Zod or equivalent Ticket does not require validation; tracked separately in HNT-2487. Library evaluation preserved in the collapsed section below as a starting point

GraphQL client: raw fetch

The corpus-api client calls a single GraphQL mutation (updateApprovedCorpusItem) using fetch() with an inline mutation string and manual TypeScript types. We considered three approaches:

Approach Pros Cons
Raw fetch (chosen) Zero dependencies; matches content-monorepo lambda pattern (lambda-common/graphQlApiCalls.ts); simple for a single mutation; full control over headers, retries, and error handling No schema validation at build time; mutation string could drift from the API schema
@apollo/client + graphql-codegen Type-safe queries generated from the live schema; used by curation-admin-tools (React frontend) Heavy dependency (~200 kB); designed for frontend caching/hooks; overkill for a backend service calling one mutation
graphql-request Lightweight GraphQL client with TypeScript support Extra dependency for minimal benefit over raw fetch; no codegen without additional tooling

We chose raw fetch because the crawler only calls one mutation, and this matches the established backend convention in content-monorepo's lambdas (corpus-scheduler-lambda, section-manager-lambda). Those lambdas use the same pattern: fetch() POST with { query, variables } body, manual Authorization and apollographql-client-name headers, and hand-written TypeScript types. curation-admin-tools uses Apollo Client + codegen, but that pattern serves a React frontend with many queries/mutations and caching needs.

JWT library: jose

The Corpus API client authenticates with RS256 JWTs signed from a JWK private key stored in Google Secret Manager. content-monorepo's lambda-common uses jsonwebtoken + jwk-to-pem for the same purpose (signing JWTs to call the Corpus Admin API from Lambda functions). curation-admin-tools does not sign JWTs; it receives OAuth2 ID tokens from Mozilla's auth provider and passes them as Bearer tokens.

Legend: 🟢 meets requirement · 🟡 meets with caveat · 🔴 does not meet.

Library Pure ESM Native JWK Single package Notes
jose (chosen) 🟢 🟢 🟢 Built on Web Crypto; no native addons; importJWK takes the JWK directly.
jsonwebtoken + jwk-to-pem 🟡 🔴 🔴 CommonJS-first; requires require() interop in ESM. JWK must first be converted to PEM via jwk-to-pem. Two packages for one signing flow.

We chose jose because this is a pure ESM monorepo targeting Node 24, and jose's native importJWK eliminates the JWK-to-PEM conversion step entirely. content-monorepo uses jsonwebtoken because it predates the ESM transition and runs in Lambda where CommonJS is the default; if those lambdas were being written today, jose would be the better fit there too.

Validation library evaluation (deferred)

This PR does not wire up runtime validation; the Pub/Sub input and BigQuery event types are plain TypeScript interfaces. The evaluation below was done during implementation and is retained for whichever follow-up PR wires validation to the Pub/Sub receive path and/or the articles publish path.

All candidates exceed the pipeline's ~10 validations/sec peak by 5+ orders of magnitude, so throughput was not a differentiator.

Legend: 🟢 meets requirement · 🟡 meets with caveat · 🔴 does not meet.

Library Strong TS inference Pure ESM Structured errors Notes
Zod v4 (leaning) 🟢 🟢 🟢 Strongest ecosystem (42k stars); best error reporting for poison-message logs.
Typia 🟢 🔴 🟡 Requires the ts-patch compiler plugin; forward-compat risk with TypeScript's Go port (tsgo). Errors are readable but less structured than Zod's issue paths.
Valibot 🟢 🟢 🟢 Much smaller ecosystem and adoption than Zod.
ArkType 🟢 🟢 🟢 Significantly smaller adoption for a shared monorepo dependency.
AJV + TypeBox 🟡 🟢 🔴 TypeBox needed as a second package to get TS inference from JSON Schema. Fail-fast mode returns a boolean + separate errors array, which is less ergonomic for structured logs.

The benchmark validated real articles and article_discoveries event shapes from this pipeline; the x-axis is validations per second on a log scale, and the dashed line marks our ~10/sec peak requirement.

Screenshot from 2026-04-21 14-31-53

Zod looks like the best fit because it provides the cleanest schema-first decode pipeline for Pub/Sub boundaries (JSON string in, typed object out), has the strongest error reporting for diagnosing poison messages in production logs, and introduces zero toolchain complexity in a pure ESM monorepo. Typia's ts-patch requirement is a forward-compatibility risk as TypeScript moves toward its Go-based compiler, and its transform/coercion story is less natural for boundary decoding than Zod's schema-driven approach.

mmiermans and others added 5 commits April 15, 2026 19:36
Implement the article extraction handler that calls Zyte, maps the
response to the articles event schema, detects title/excerpt changes
for live articles, and syncs them to the Curated Corpus Admin API.

New modules in crawl-common: text normalization (ported from Python
diff.py), SHA-256 url hashing, Pub/Sub message types with Zod schemas,
and a Corpus Admin API GraphQL client with RS256 JWT auth via jose.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add fetch timeout, null-data guard, and empty-JWK guard to the Corpus
API client. Fix buildUpdateInput to only send changed fields, preventing
cosmetic overwrites of curator edits. Move retry test to integration
file with fake timers (2s -> instant) and add token refresh coverage.
Consolidate duplicate tests with it.each.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Remove urlHash (unused in this PR; belongs in Task 7.1 when Redis
state client is implemented). Extract shared corpus-api test fixtures
into test-helpers.ts to eliminate duplication between unit and
integration tests. Clarify normalizeText doc block.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Runtime validation is not required by the ticket and the schemas
were unused at runtime (only z.infer for type derivation). Defer
the validation library choice to whichever PR wires up Pub/Sub
message parsing for real.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
published_at?: string;
breadcrumbs?: ArticleBreadcrumb[];
language?: string;
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a good place to validate the schema, but I created a separate ticket to keep the scope of this PR manageable.

extractArticle: vi.fn(),
updateApprovedCorpusItem: vi.fn(),
};
});
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This apparently needs to happen before the following imports. I see a similar pattern in some content-monorepo modules, including lambdas/section-manager-lambda/src/graphQlApiCalls.spec.ts, so I guess this is conventional?

mmiermans and others added 2 commits April 22, 2026 09:12
- Handler integration test now stubs fetch via vi.stubGlobal and
  exercises the real Zyte and Corpus API clients end-to-end
  (JWT signing, GraphQL body serialization, response parsing),
  instead of module-mocking like the unit tests.
- Extract shared fixtures (ZYTE_ARTICLE, BASE_MESSAGE, CORPUS_ITEM,
  TEST_JWK) into services/crawl-worker/src/handlers/test-helpers.ts.
- Rename SAMPLE_INPUT / SUCCESS_BODY to UPDATE_APPROVED_CORPUS_ITEM_*
  so the next mutation's fixtures don't collide.
- Rename the one integration test that broke the 'it <verb>'
  convention.
- Drop the redundant camelCase comment from
  UpdateApprovedCorpusItemInput's doc block.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
},
},
);
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually we probably will want to break this file up, or introduce abstractions, but I figured we could start simple.

@mmiermans mmiermans marked this pull request as ready for review April 22, 2026 17:59
- Replace literal assertion values with references to their source
  fixture (ZYTE_ARTICLE, CORPUS_ITEM, CLIENT_OPTS,
  UPDATE_APPROVED_CORPUS_ITEM_INPUT/_SUCCESS_BODY) so fixture
  changes do not silently break test intent.
- Type the integration test's fetch mock via vi.fn<typeof fetch>()
  to match the sibling pattern in corpus-api/client.integration.ts
  and drop the as RequestInit cast.
- Export TOKEN_REFRESH_WINDOW_MS so the token-refresh tests derive
  the window from client config instead of re-hardcoding 285s.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@mmiermans mmiermans merged commit 8917542 into main Apr 22, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant