DurableEmitter: LOOP Plugin Support#2073
Conversation
✅ API Diff Results -
|
|
@mathewdgardner FYI |
SummaryThis PR enables
Stats: 6 files, +343 / -61 | 14 commits | Mergeable state: blocked Batching Logic Gaps (vs
|
| Factor | Status |
|---|---|
| Tests updated | ✅ Renames covered |
| New store code tested | ❓ No test file for beholderstore in this PR |
| gRPC size safety | ❌ Missing |
| Seqnum parity | ❌ Missing |
Recommendation: Address the gRPC size splitting and seqnum gaps before merge — both are correctness issues that could cause silent data loss or ordering violations in production.
Suggestion to consolidate batching logicDurableEmitter as DB-persistence layer + BatchEmitterService as transport layerCurrent design — DurableEmitter reimplements batching from scratch: Proposed design — DurableEmitter delegates transport to BatchEmitterService: Why it fits naturally
Tradeoffs to consider
Sketchtype DurableEmitter struct {
store DurableEventStore
batchEmitter interface {
EmitWithCallback(ctx context.Context, body []byte, cb func(error), attrKVs ...any) error
}
// retransmit, cfg, stopCh, wg, metrics — same as today
}
func (d *DurableEmitter) Emit(ctx context.Context, body []byte, attrKVs ...any) error {
payload := serialize(body, attrKVs...)
id, err := d.store.Insert(ctx, payload)
if err != nil {
return err
}
// Fire-and-forget through batch emitter; callback marks delivered
_ = d.batchEmitter.EmitWithCallback(ctx, body, func(sendErr error) {
if sendErr == nil {
d.store.MarkDelivered(context.Background(), id)
}
}, attrKVs...)
return nil
}Bottom line: DurableEmitter currently duplicates ~300 lines of batching logic that |
pkcll
left a comment
There was a problem hiding this comment.
@DylanTinianov, Batching logic gaps DurableEmitter is missing gRPC request size splitting (risks exceeding max message size) and seqnum stamping (breaks ordering guarantees for downstream consumers). Also flagged: no shutdown timeout in Close() and unbounded MarkDelivered goroutines. Please address before merge.
|
Proposal: Compose with BatchEmitterService @DylanTinianov, @patrickhuie19, @jmank88, would appreciate your thoughts on feasibility. |
|
@pkcll Thanks for the comments! Added tickets for these to our EPIC. Will discuss with @patrickhuie19 and @tarcisiozf to scope this out. This PR is still needed for LOOP support, regardless of the implementation changes. Also |
|
@DylanTinianov, this one needs to be addressed for sure before merging Critical: No gRPC Request Size Splitting |
|
Also @jmank88 let's discuss where the setup code should live. Some concerns from @mathewdgardner :
We could revert back to the helper function I had initially? Although I agree the implementation is cleaner on the client. |
| if s.EnvConfig.ChipIngressDurableEmitterEnabled { | ||
| if s.DataSource == nil { | ||
| return fmt.Errorf("durable emitter requires a database connection: set CL_DATABASE_URL") | ||
| } | ||
| err = beholderClient.SetupDurableEmitter(beholderstore.New(s.DataSource), false) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| err = beholderClient.StartDurableEmitter(ctx) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| } |
There was a problem hiding this comment.
pkg/loop/server.go line 359 beholderClient.StartDurableEmitter(ctx)
DurableEmitter has a manual Start(ctx) / Close() lifecycle that doesn't implement services.Service. This means:
- No health reporting it can't be registered with
s.checker, so a stuck retransmit or publish loop is invisible to health checks (have not done for ChipIngressBatchEmitterService yet) - Fragile ordering
StartDurableEmitteris called beforebeholderClient.Start(ctx), but shutdown relies onbeholderClient.Close()transitively closing it. This implicit coupling is easy to break. - No idempotency / state guards
services.Service(viaservices.Engine) gives youIfStarted,IfNotStopped, ready/healthy state for free.
Consider having DurableEmitter implement services.Service (like ChipIngressBatchEmitterService already does)
This would also let you remove SetupDurableEmitter + StartDurableEmitter as separate methods on Client just expose the service and let the caller manage its lifecycle uniformly.
| ChipIngressBatchEmitterEnabled bool | ||
| ChipIngressEndpoint string | ||
| ChipIngressInsecureConnection bool | ||
| ChipIngressBatchEmitterEnabled bool |
There was a problem hiding this comment.
Need to add guard or mutual exclusion for these two ChipIngressBatchEmitterEnabled, ChipIngressDurableEmitterEnabled
If both are enabled it should either error out or document precedence.
They conflict silently. Here's what happens:
-
NewClientseesChipIngressBatchEmitterEnabled=true→ createsChipIngressBatchEmitterServiceas the chip emitter, wraps it inDualSourceEmitter, assigns toclient.Emitter. -
Then
SetupDurableEmitter(from this PR) overwritesclient.Emitterentirely:
c.Emitter = dualEmitter // replaces the previously-assigned DualSourceEmitter
c.durableEmitter = durableEmitterSo with both enabled:
- The
BatchEmitterServiceis still created and started as a sub-service (lifecycle managed by the engine) but nothing emits to it — it's orphaned. DurableEmitteruses its own internal batching against the rawchipingress.Client, bypassing the batch emitter entirely.
There was a problem hiding this comment.
Sounds good, created a ticket for this as well.
mathewdgardner
left a comment
There was a problem hiding this comment.
@DylanTinianov the beholder client is for telemetry which is best effort and isn't durable by design. For the use cases where we want message delivery guarantees like some valuable events, we should use the chip-ingress gRPC client directly and outside of the beholder pkg. This also means that the chip-ingress client is a dependency for your new event emitter.
We should keep telemetry use cases separate from the durable event use cases.
| cfg DurableEmitterConfig | ||
| log logger.Logger | ||
|
|
||
| metrics *durableEmitterMetrics |
There was a problem hiding this comment.
Observability Comparison: DurableEmitter vs BatchEmitterService + batch.Client
Metrics Coverage
| Concern | BatchEmitterService + batch.Client | DurableEmitter |
|---|---|---|
| Events sent (success) | ✅ chip_ingress.events_sent per domain/entity |
✅ beholder.durable_emitter.emit.success (count only, no domain/entity breakdown) |
| Events dropped/failed | ✅ chip_ingress.events_dropped per domain/entity |
✅ beholder.durable_emitter.emit.failure (no domain/entity) |
| Batch send requests | ✅ chip_ingress.batch.send_requests_total with status=success/failure |
❌ No per-RPC request counter |
| Request size (messages) | ✅ chip_ingress.batch.request_size_messages histogram with max_batch_size attr |
❌ Not tracked |
| Request size (bytes) | ✅ chip_ingress.batch.request_size_bytes histogram with max_grpc_request_size attr |
❌ Not tracked |
| Request latency | ✅ chip_ingress.batch.request_latency_ms histogram with status |
OnBatchPublish) — not exported as a metric instrument |
| Config info gauge | ✅ chip_ingress.batch.config.info with all config attrs |
❌ Not tracked |
| Queue depth | ❌ Not tracked (fire-and-forget buffer) | ✅ queueDepth / queueDepthMax gauges |
| Emit latency (caller-facing) | ❌ Not tracked | ✅ emit.duration + emit.total_duration histograms |
| DB store operations | N/A (no DB) | ✅ Instrumented store wrapper (insert/mark/purge durations) |
| Near-TTL / expiry | N/A | ✅ expiredPurged, NearTTLCount via ObserveDurableQueue |
| Publish batch events OK/err | Implicit via sent/dropped counters | ✅ publishBatchEvOK / publishBatchEvErr |
Key Gaps in DurableEmitter
-
No per-domain/entity attribution — BatchEmitterService tags every metric with
domain+entityviametricAttrsFor(). DurableEmitter metrics are flat counters with no cardinality, making it impossible to identify which event source is failing. -
No request size observability — batch.Client records both message count and byte size histograms per send. DurableEmitter has no visibility into whether batches are approaching gRPC limits (which ties back to the missing size-splitting gap).
-
No send request counter with status — batch.Client's
send_requests_totalwith success/failure lets you compute error rate. DurableEmitter only logs failures; no metric-based alerting possible on publish RPC error rate. -
No config info metric — batch.Client emits a gauge with all configuration parameters (batch size, buffer size, timeouts, etc.) for runtime introspection. DurableEmitter has no equivalent — you can't verify running config from metrics alone.
-
Latency only via hooks, not instruments —
OnBatchPublishcallback provides latency to test code but doesn't register an OTel histogram. You can't dashboard or alert on publish latency without custom wiring.
Where DurableEmitter is Better
- Queue depth visibility — critical for a persistence-backed system. You can alert on backlog growth.
- Emit-path latency — tracks how long the caller blocks on DB insert, useful for detecting DB pressure.
- Store-layer instrumentation — the metrics-instrumented store wrapper gives per-operation DB latency.
Logging
| Concern | batch.Client | DurableEmitter |
|---|---|---|
| Publish failure | ✅ Errorw("failed to publish batch") |
✅ Warnw("PublishBatch failed, events will be retransmitted") |
| Buffer full | Silent (returns error to caller) | ✅ Warnw("batch publish channel full, relying on retransmit") |
| Retransmit activity | N/A | ✅ Logs enqueue counts, skipped, pending depth |
| Shutdown timeout | ✅ Warnw("timed out waiting for shutdown") |
❌ No timeout exists |
| Config at startup | Via gauge metric only | Via Infow log lines (coalescing, raw-codec) |
Summary
BatchEmitterService has better transport-layer observability (request sizes, latency histograms, per-domain attribution, config gauges). DurableEmitter has better persistence-layer observability (queue depth, DB operation metrics, emit latency). If DurableEmitter composed over BatchEmitterService, you'd get both layers covered without duplication.
Sets up DurableEmitter for LOOP Plugins in the Server and implements the following:
isHostProcesstoretransmit