feat(HNT-2108): add article extraction handler with Corpus API client#8
Merged
feat(HNT-2108): add article extraction handler with Corpus API client#8
Conversation
Implement the article extraction handler that calls Zyte, maps the response to the articles event schema, detects title/excerpt changes for live articles, and syncs them to the Curated Corpus Admin API. New modules in crawl-common: text normalization (ported from Python diff.py), SHA-256 url hashing, Pub/Sub message types with Zod schemas, and a Corpus Admin API GraphQL client with RS256 JWT auth via jose. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add fetch timeout, null-data guard, and empty-JWK guard to the Corpus API client. Fix buildUpdateInput to only send changed fields, preventing cosmetic overwrites of curator edits. Move retry test to integration file with fake timers (2s -> instant) and add token refresh coverage. Consolidate duplicate tests with it.each. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Remove urlHash (unused in this PR; belongs in Task 7.1 when Redis state client is implemented). Extract shared corpus-api test fixtures into test-helpers.ts to eliminate duplication between unit and integration tests. Clarify normalizeText doc block. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Runtime validation is not required by the ticket and the schemas were unused at runtime (only z.infer for type derivation). Defer the validation library choice to whichever PR wires up Pub/Sub message parsing for real. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
mmiermans
commented
Apr 21, 2026
| published_at?: string; | ||
| breadcrumbs?: ArticleBreadcrumb[]; | ||
| language?: string; | ||
| } |
Collaborator
Author
There was a problem hiding this comment.
This may be a good place to validate the schema, but I created a separate ticket to keep the scope of this PR manageable.
mmiermans
commented
Apr 21, 2026
| extractArticle: vi.fn(), | ||
| updateApprovedCorpusItem: vi.fn(), | ||
| }; | ||
| }); |
Collaborator
Author
There was a problem hiding this comment.
This apparently needs to happen before the following imports. I see a similar pattern in some content-monorepo modules, including lambdas/section-manager-lambda/src/graphQlApiCalls.spec.ts, so I guess this is conventional?
- Handler integration test now stubs fetch via vi.stubGlobal and exercises the real Zyte and Corpus API clients end-to-end (JWT signing, GraphQL body serialization, response parsing), instead of module-mocking like the unit tests. - Extract shared fixtures (ZYTE_ARTICLE, BASE_MESSAGE, CORPUS_ITEM, TEST_JWK) into services/crawl-worker/src/handlers/test-helpers.ts. - Rename SAMPLE_INPUT / SUCCESS_BODY to UPDATE_APPROVED_CORPUS_ITEM_* so the next mutation's fixtures don't collide. - Rename the one integration test that broke the 'it <verb>' convention. - Drop the redundant camelCase comment from UpdateApprovedCorpusItemInput's doc block. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
mmiermans
commented
Apr 22, 2026
| }, | ||
| }, | ||
| ); | ||
| } |
Collaborator
Author
There was a problem hiding this comment.
Eventually we probably will want to break this file up, or introduce abstractions, but I figured we could start simple.
- Replace literal assertion values with references to their source fixture (ZYTE_ARTICLE, CORPUS_ITEM, CLIENT_OPTS, UPDATE_APPROVED_CORPUS_ITEM_INPUT/_SUCCESS_BODY) so fixture changes do not silently break test intent. - Type the integration test's fetch mock via vi.fn<typeof fetch>() to match the sibling pattern in corpus-api/client.integration.ts and drop the as RequestInit cast. - Export TOKEN_REFRESH_WINDOW_MS so the token-refresh tests derive the window from client config instead of re-hardcoding 285s. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Goal
HNT-2108
Implement the article extraction handler for the crawl worker. When a
crawl-articlePub/Sub message arrives, the handler calls Zyte's article extraction API, maps the response to the BigQueryarticlesevent schema, and (for live articles with acorpus_item) detects title/excerpt changes and syncs them to the Curated Corpus Admin API via GraphQL.New packages added to
crawl-common:updateApprovedCorpusItemwith RS256 JWT auth (jose), in-process token caching with 5-min TTL and 95% refresh buffer, and p-retry with exponential backoff.CrawlArticleMessage(Pub/Sub input) andArticleEvent(BigQuery output).normalizeText(ported from Pythondiff.py).Implementation decisions
diff.py:normalize_text()with identical step ordering (NFC, trim, truncate, strip periods, collapse whitespace, lowercase, normalize quotes)buildUpdateInputonly overrides title or excerpt when the normalized comparison actually detected a changeinitCorpusApiClient()+ module-levelletvariablesGraphQL client: raw fetch
The corpus-api client calls a single GraphQL mutation (
updateApprovedCorpusItem) usingfetch()with an inline mutation string and manual TypeScript types. We considered three approaches:lambda-common/graphQlApiCalls.ts); simple for a single mutation; full control over headers, retries, and error handlingWe chose raw fetch because the crawler only calls one mutation, and this matches the established backend convention in content-monorepo's lambdas (corpus-scheduler-lambda, section-manager-lambda). Those lambdas use the same pattern:
fetch()POST with{ query, variables }body, manualAuthorizationandapollographql-client-nameheaders, and hand-written TypeScript types. curation-admin-tools uses Apollo Client + codegen, but that pattern serves a React frontend with many queries/mutations and caching needs.JWT library: jose
The Corpus API client authenticates with RS256 JWTs signed from a JWK private key stored in Google Secret Manager. content-monorepo's
lambda-commonusesjsonwebtoken+jwk-to-pemfor the same purpose (signing JWTs to call the Corpus Admin API from Lambda functions). curation-admin-tools does not sign JWTs; it receives OAuth2 ID tokens from Mozilla's auth provider and passes them as Bearer tokens.Legend: 🟢 meets requirement · 🟡 meets with caveat · 🔴 does not meet.
importJWKtakes the JWK directly.require()interop in ESM. JWK must first be converted to PEM viajwk-to-pem. Two packages for one signing flow.We chose jose because this is a pure ESM monorepo targeting Node 24, and jose's native
importJWKeliminates the JWK-to-PEM conversion step entirely. content-monorepo uses jsonwebtoken because it predates the ESM transition and runs in Lambda where CommonJS is the default; if those lambdas were being written today, jose would be the better fit there too.Validation library evaluation (deferred)
This PR does not wire up runtime validation; the Pub/Sub input and BigQuery event types are plain TypeScript interfaces. The evaluation below was done during implementation and is retained for whichever follow-up PR wires validation to the Pub/Sub receive path and/or the
articlespublish path.All candidates exceed the pipeline's ~10 validations/sec peak by 5+ orders of magnitude, so throughput was not a differentiator.
Legend: 🟢 meets requirement · 🟡 meets with caveat · 🔴 does not meet.
ts-patchcompiler plugin; forward-compat risk with TypeScript's Go port (tsgo). Errors are readable but less structured than Zod's issue paths.The benchmark validated real
articlesandarticle_discoveriesevent shapes from this pipeline; the x-axis is validations per second on a log scale, and the dashed line marks our ~10/sec peak requirement.Zod looks like the best fit because it provides the cleanest schema-first decode pipeline for Pub/Sub boundaries (JSON string in, typed object out), has the strongest error reporting for diagnosing poison messages in production logs, and introduces zero toolchain complexity in a pure ESM monorepo. Typia's ts-patch requirement is a forward-compatibility risk as TypeScript moves toward its Go-based compiler, and its transform/coercion story is less natural for boundary decoding than Zod's schema-driven approach.