feat: add outbound webhooks with at-least-once delivery#325
feat: add outbound webhooks with at-least-once delivery#325edsonmichaque wants to merge 4 commits intomainfrom
Conversation
|
This PR introduces a persistent outbound webhook system with at-least-once delivery guarantees. It allows external services to subscribe to real-time notifications for platform events (e.g., LLM creation, app updates). The system is designed for reliability and security, featuring a persistent delivery queue, atomic event processing, a flexible 3-level retry policy, and per-subscription transport configurations. Security is addressed through HMAC-SHA256 payload signing to prevent replay attacks, robust SSRF protection, and encryption-at-rest for sensitive credentials. Files Changed AnalysisThe changes introduce a new, self-contained webhook feature, with additions totaling over 2,500 lines across 11 files. The implementation is well-structured:
Architecture & Impact Assessment
flowchart TD
A[Service Layer] -->|Publish| B[eventbridge.Bus]
B -->|SubscribeAll| C[WebhookService.HandleEvent]
C -->|Persist row| D["WebhookEvent table (status=pending)"]
C -->|Non-blocking push| E[queue chan uint]
E --> F[Worker goroutines]
G[DB Poller] -->|Poll due events| D
G -->|Push IDs| E
F -->|resolveRetryPolicy| P[Merge: sub > DB global > static]
P -->|SSRF check + HMAC sign| H[HTTP POST → external URL]
H -->|2xx| I["Persist success log<br>mark event delivered"]
H -->|failure| J["Persist failed log<br>schedule next attempt"]
J -->|attempt < maxAttempts| K["Update next_run_at<br>poller picks up"]
J -->|attempt == maxAttempts| L[Mark event exhausted]
M[Manual RetryDelivery API] -->|New WebhookEvent row| D
Scope Discovery & Context ExpansionThe scope of this feature is defined by the list of The broader impact is that any existing or future action that publishes one of these events to the Metadata
Powered by Visor from Probelabs Last updated: 2026-02-28T11:49:08.403Z | Triggered by: pr_updated | Commit: 503a68d 💡 TIP: You can chat with Visor using |
Security Issues (2)
Architecture Issues (1)
Performance Issues (3)
Quality Issues (1)
Powered by Visor from Probelabs Last updated: 2026-02-28T11:49:10.907Z | Triggered by: pr_updated | Commit: 503a68d 💡 TIP: You can chat with Visor using |
Implements a persistent outbound webhook system for notifying external services of platform events (LLM, app, user, tool, etc. CRUD events). Key design decisions: - DB-backed delivery queue (WebhookEvent) survives process restarts - Atomic in_flight claim prevents double-delivery across workers - 3-level retry policy: per-subscription > DB singleton > static config - Normalized topic join table (WebhookTopic) instead of JSON column - Per-subscription transport config: proxy, TLS CA, mTLS, skip-verify - Replay attack prevention: HMAC-SHA256 signs "timestamp.body"; receivers can reject stale requests outside their tolerance window - SSRF protection at write time and delivery time; configurable via ALLOW_INTERNAL_NETWORK_ACCESS env var (resolved at startup, not per call) - Topic validation against KnownWebhookTopics derived from system_events.go REST endpoints (admin-only): POST /api/v1/webhooks GET /api/v1/webhooks GET /api/v1/webhooks/topics GET /api/v1/webhooks/config PUT /api/v1/webhooks/config GET /api/v1/webhooks/:id PUT /api/v1/webhooks/:id DELETE /api/v1/webhooks/:id POST /api/v1/webhooks/:id/test GET /api/v1/webhooks/:id/deliveries POST /api/v1/webhooks/:id/deliveries/:log_id/retry
53c0e2b to
dc02959
Compare
…ooks
- Extract actor user ID from gin context ("user" key) in RetryDelivery
handler; pass it through to service as actorID and store on
WebhookEvent.TriggeredBy (0 = system-initiated, non-zero = manual retry)
- Replace hardcoded limit=50 in ListDeliveries with getPaginationParams()
(?page, ?page_size, ?all), matching the pattern used across all other
list endpoints; set X-Total-Count and X-Total-Pages response headers
- Update ListDeliveryLogs service method to return (logs, totalCount,
totalPages, error) consistent with other paginated service methods
Add BeforeSave/AfterFind GORM hooks on WebhookSubscription to encrypt/decrypt sensitive fields using the platform's existing AES-256 + $ENC/ prefix convention (TYK_AI_SECRET_KEY env var): - Secret (HMAC signing secret) - TransportConfig.ProxyURL (may contain embedded credentials) - TransportConfig.TLSCACert - TransportConfig.TLSClientCert - TransportConfig.TLSClientKey Matches the pattern used by models.Submission. Encryption is a no-op when TYK_AI_SECRET_KEY is not configured (graceful degradation).
- RetryDelivery: pass subscriptionID from URL to enforce ownership; service now validates log.SubscriptionID matches the URL param before re-enqueueing - Error responses: replace err.Error() in 500 handlers with generic messages to avoid leaking internal details to callers - Remove redundant GetWebhook existence checks from Delete and ListDeliveries handlers (DeleteWebhook and ListDeliveryLogs handle missing records) - Config hot path: add 30-second in-memory cache with double-checked locking; invalidated immediately on UpdateWebhookConfig - N+1 inserts: HandleEvent now collects all WebhookEvent rows into a slice and issues a single db.Create call - Composite index on (status, next_run_at) for efficient poller query - LogRetentionDays in WebhookConfig with daily pruning tick in poller - LogPruneInterval moved to WebhookServiceConfig (no hardcoded 24h ticker) - Fix brittle fmt.Sprint slice comparison → reflect.DeepEqual in test - Fix RetryDelivery test call sites for updated 3-arg signature
|
|
Summary
Adds a persistent outbound webhook system that notifies external services when platform objects change (LLMs, apps, users, tools, filters, plugins, model routers, etc.).
Architecture
WebhookEventrows survive process restarts; the poller reschedules anything left undeliveredUPDATE WHERE status='pending' SET status='in_flight'prevents two workers racing to deliver the same eventWebhookConfig) → static startup config; changing the DB config takes effect on the next delivery attempt without restartWebhookTopicjoin table with composite unique index instead of a JSON column;findMatchingSubscriptionsuses a SQL JOIN; topic validation againstKnownWebhookTopicsat write timeinsecure_skip_verify; each delivery builds its ownhttp.Clientfrom the subscription's config"<unix_timestamp>.<body>"and sends bothX-Tyk-TimestampandX-Tyk-Signature: sha256=<hex>headers; receivers check the timestamp is within their tolerance window (e.g. 5 minutes) before verifying the signatureALLOW_INTERNAL_NETWORK_ACCESSenv var resolved once at startup, not per callRetryDeliveryextracts the authenticated user from gin context ("user"key, set byAuthMiddleware) and stampsWebhookEvent.TriggeredBywith the actor's user ID; 0 = system-initiated, non-zero = manually triggered by a specific adminGET .../deliveriesusesgetPaginationParams()(?page,?page_size,?all) and setsX-Total-Count/X-Total-Pagesheaders, consistent with all other list endpointsSignature verification (receiver side)
REST endpoints (admin-only)
POST/api/v1/webhooksGET/api/v1/webhooksGET/api/v1/webhooks/topicsGET/api/v1/webhooks/configPUT/api/v1/webhooks/configGET/api/v1/webhooks/:idPUT/api/v1/webhooks/:idDELETE/api/v1/webhooks/:idPOST/api/v1/webhooks/:id/testGET/api/v1/webhooks/:id/deliveries?page,?page_size,?all)POST/api/v1/webhooks/:id/deliveries/:log_id/retryNew tables
webhook_subscriptions— endpoint config + retry policy + transport configwebhook_topics— subscription ↔ topic join table (normalized, composite unique index)webhook_events— persistent delivery queue (pending→in_flight→delivered/exhausted);triggered_byrecords the actor for manual retrieswebhook_delivery_logs— per-attempt audit logwebhook_configs— DB singleton for runtime-tunable global defaultsTest plan
CGO_ENABLED=1 go test ./services/... -run Webhook -v— all tests passGET /api/v1/webhooks/topicsreturns the full known topic listX-Tyk-Timestamp+X-Tyk-Signatureheadersinsecure_skip_verify: trueon a subscription pointing at a self-signed HTTPS serverproxy_urland verify traffic routes through itPOST .../deliveries/:log_id/retry; confirmtriggered_byis set to the authenticated admin's user IDWebhookConfigviaPUT /api/v1/webhooks/config; confirm new retry policy takes effect without restartGET .../deliveries?page=1&page_size=10returns paginated results withX-Total-CountandX-Total-Pagesheaders