Skip to content

fix(qwp): fix ingestion stalls after append failures and on sender startup#32

Merged
bluestreak01 merged 2 commits into
mainfrom
vi_fix_send
May 24, 2026
Merged

fix(qwp): fix ingestion stalls after append failures and on sender startup#32
bluestreak01 merged 2 commits into
mainfrom
vi_fix_send

Conversation

@bluestreak01
Copy link
Copy Markdown
Member

@bluestreak01 bluestreak01 commented May 23, 2026

Refs questdb/questdb#7143

This PR bundles two QWP fixes for the upcoming 1.3.1 client release. Both are small, localized, and validated by their own regression tests.

Microbatch buffer release on append failure

  • QwpWebSocketSender.sealAndSwapBuffer() left the microbatch buffer in SENDING state when CursorSendEngine.appendBlocking() threw. No I/O thread ever recycles a buffer the engine never accepted, so the next flush would wait 30 s on the recycle latch and then throw Timeout waiting for buffer to be recycled.
  • The catch block now puts the buffer back into a non-in-use state on the user thread before rethrowing: markRecycled() from SENDING, rollbackSealForRetry() from SEALED.
  • The encoded payload itself is dropped, but flushPendingRows() aborts its post-enqueue state updates after sealAndSwapBuffer() throws, so the source rows in tableBuffers and the sent-schema watermark stay intact. The next batch re-emits the same rows together with the full schema and symbol-dict delta — matching the existing "self-sufficient frames" invariant the cursor SF pipeline already relies on.

Segment manager wakeup on register

  • SegmentManager.register() now unparks the worker after publishing the new ring. Without the wakeup, if the worker thread acquires the manager lock before register() does, it snapshots an empty rings, services nothing, and parks for the full poll interval. A ring whose first append does not cross the high-water mark fires no producer-side wakeup either, so the spare never lands until the poll expires.
  • The race is pre-existing in the store-and-forward code, latent since the SF feature first landed. It surfaced on the parent-repo CI run only because the mac-other + JaCoCo combination consistently scheduled the worker first; the prior fix (commit 19c5c65) widened the test budget to 2 s but the poll interval is 5 s, so no budget below 5 s could rescue that ordering.
  • The fix is one call to wakeWorker() after setManagerWakeup(). LockSupport.unpark is cheap, is a no-op when the worker has not been started, and grants a permit that the next parkNanos consumes immediately — both interleavings are covered.

Tradeoffs

  • The microbatch-buffer fix's close()-after-failed-flush retries the drain. If the underlying engine is still wedged (e.g. permanent backpressure, like the deliberately-undersized engine in the new test), that retry fails again and surfaces as a separate LineSenderException. The test acknowledges this and swallows the close-time rethrow. Real-world transient backpressure is expected to clear between the user's flush call and close(), in which case the drain succeeds.
  • The segment-manager fix bundles a second unrelated change under a title that names only the microbatch fix. The release-note line for 1.3.1 will undercount the actual scope. Mitigated by clearly separating the two fixes in this description and in the commit history (two separate commits).
  • No state-machine extension on either fix: MicrobatchBuffer.markRecycled() / rollbackSealForRetry() and SegmentManager.wakeWorker() already existed.

Test plan

  • New QwpWebSocketSenderTest.testFlushAppendFailureDoesNotLeaveMicrobatchBufferInUse (the reproducer from QWP WS: cursor append failure leaves microbatch buffer stuck in SENDING questdb#7143). Fails on main with buffer0=SENDING, buffer1=FILLING and the 30 s recycle-timeout suppressed exception; passes in ~80 ms with this change.
  • New SegmentManagerTest.testRegisterAfterWorkerParkedWakesWorker. Sleeps 250 ms between start() and register() to guarantee the worker has parked on an empty rings snapshot, then asserts the spare lands within 2 s. Without the wakeWorker() call this test fails reliably; with it, all 9 SegmentManagerTest cases pass.
  • Full QwpWebSocketSenderTest and the broader QWP client suite still pass.

🤖 Generated with Claude Code

When CursorSendEngine.appendBlocking() throws inside
sealAndSwapBuffer(), the catch block now puts the sealed buffer back
into a state isInUse() reports as false: markRecycled() when it is
SENDING, rollbackSealForRetry() when it is still SEALED.

Without this, the buffer stayed in SENDING forever. No I/O thread ever
recycles a buffer the engine never accepted, so the next flush would
wait the 30 s recycle timeout and throw "Timeout waiting for buffer
to be recycled".

The encoded payload is dropped, but flushPendingRows bails out of its
post-enqueue state updates after sealAndSwapBuffer throws, so the
source rows and the sent-schema watermark stay intact and the next
batch re-emits the same rows along with the full schema and
symbol-dict delta.

Adds testFlushAppendFailureDoesNotLeaveMicrobatchBufferInUse, the
reproducer from questdb/questdb#7143. A memory-only CursorSendEngine
configured to fail every append lets the test confirm both microbatch
buffers leave the SENDING and SEALED states after a flush failure.

Refs questdb/questdb#7143

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SegmentManager.register() now unparks the worker thread after
publishing the new ring. Without this, register-after-start has a
race: start() schedules the worker, and if that thread reaches
workerLoop and takes `lock` before register() does, it snapshots an
empty `rings`, services nothing, and parks for the full poll
interval. A ring whose first append does not cross the high-water
mark fires no producer-side wakeup either, so the spare never lands
until the poll expires.

testFirstSpareLandsBeforeFirstPoll fails on CI under JaCoCo on the
mac-other runner whenever the worker wins the lock first; the prior
fix (commit 19c5c65) only widened the budget to 2s, but the poll
interval is 5s so no budget below 5s can rescue that ordering. The
LockSupport.unpark is cheap, no-ops when the worker has not been
started, and grants a permit that the next parkNanos consumes
immediately, so it covers both interleavings.

Adds testRegisterAfterWorkerParkedWakesWorker as a deterministic
regression test: sleeps 250ms between start() and register() so the
worker is guaranteed to have parked, then asserts the spare lands
within 2s. Without the wakeWorker() call in register() this test
fails reliably; with it, all 9 SegmentManagerTest cases pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mtopolnik
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 3 / 5 (60.00%)

file detail

path covered line new line coverage
🔵 io/questdb/client/cutlass/qwp/client/QwpWebSocketSender.java 2 4 50.00%
🔵 io/questdb/client/cutlass/qwp/client/sf/cursor/SegmentManager.java 1 1 100.00%

@bluestreak01 bluestreak01 changed the title fix(qwp): release microbatch buffer when cursor append fails fix(qwp): fix ingestion stalls after append failures and on sender startup May 24, 2026
@bluestreak01 bluestreak01 merged commit 24f7591 into main May 24, 2026
14 checks passed
@bluestreak01 bluestreak01 deleted the vi_fix_send branch May 24, 2026 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants