Skip to content

KAFKA-20634: Spurious HighWatermarkUpdate failed errors in the group coordinator after partition leadership change#22444

Open
dajac wants to merge 1 commit into
apache:trunkfrom
dajac:worktree-KAFKA-20634
Open

KAFKA-20634: Spurious HighWatermarkUpdate failed errors in the group coordinator after partition leadership change#22444
dajac wants to merge 1 commit into
apache:trunkfrom
dajac:worktree-KAFKA-20634

Conversation

@dajac
Copy link
Copy Markdown
Member

@dajac dajac commented Jun 1, 2026

When a __consumer_offsets partition transitions to follower, its local
log is truncated and re-replicated from the new leader. The group
coordinator hosting the partition remains active until it is unloaded
asynchronously. During that window, the partition's high watermark
advances again over records that this coordinator did not write, while
the coordinator still holds in-memory state (and pending deferred
operations) for its own records that were truncated and never durably
committed.

Applying such a high watermark has two consequences. It can violate the
invariants of the snapshot registry and fail the HighWatermarkUpdate
event, logging a spurious error such as "Execution of
HighWatermarkUpdate failed due to New committed offset X of
__consumer_offsets-N must be less than or equal to Y". More importantly,
when it does not fail, it advances the committed offset over the
coordinator's uncommitted state and completes the corresponding deferred
writes with a success response, even though those records were lost. A
client can therefore receive a successful offset-commit acknowledgment
for a commit that is silently dropped once the new coordinator takes
over.

This patch gates high watermark propagation in
CoordinatorPartitionWriter.ListenerAdapter on the partition's
leadership. The adapter stops forwarding high watermark updates once the
partition transitions to follower, is deleted, or fails. The partition
signals these transitions (via PartitionListener) before its fetcher
is restarted (see ReplicaManager#applyDelta), i.e. before any such
high watermark can be produced, so the coordinator never observes a high
watermark that it should not apply. The pending deferred operations then
remain in place and are failed with NOT_COORDINATOR when the
coordinator is unloaded, so clients correctly retry against the new
coordinator.

Gating on leadership rather than inspecting the offset is deliberate:
after truncation an offset can still have a snapshot in the registry
while holding the new leader's data, so no offset-based check can tell
whether a high watermark is safe to apply.

Reviewers: Sean Quah squah@confluent.io

…coordinator after partition leadership change

When a `__consumer_offsets` partition transitions to follower, its
local log is truncated and re-replicated from the new leader. The
group coordinator hosting the partition remains active until it is
unloaded asynchronously. During that window, the partition's high
watermark advances again over records that this coordinator did not
write, while the coordinator still holds in-memory state (and pending
deferred operations) for its own records that were truncated and never
durably committed.

Applying such a high watermark has two consequences. It can violate the
invariants of the snapshot registry and fail the `HighWatermarkUpdate`
event, logging a spurious error such as "Execution of
HighWatermarkUpdate failed due to New committed offset X of
__consumer_offsets-N must be less than or equal to Y". More
importantly, when it does not fail, it advances the committed offset
over the coordinator's uncommitted state and completes the
corresponding deferred writes with a success response, even though
those records were lost. A client can therefore receive a successful
offset-commit acknowledgment for a commit that is silently dropped once
the new coordinator takes over.

This patch gates high watermark propagation in
`CoordinatorPartitionWriter.ListenerAdapter` on the partition's
leadership. The adapter stops forwarding high watermark updates once
the partition transitions to follower, is deleted, or fails. The
partition signals these transitions (via `PartitionListener`) before
its fetcher is restarted (see `ReplicaManager#applyDelta`), i.e. before
any such high watermark can be produced, so the coordinator never
observes a high watermark that it should not apply. The pending
deferred operations then remain in place and are failed with
`NOT_COORDINATOR` when the coordinator is unloaded, so clients
correctly retry against the new coordinator.

Gating on leadership rather than inspecting the offset is deliberate:
after truncation an offset can still have a snapshot in the registry
while holding the new leader's data, so no offset-based check can tell
whether a high watermark is safe to apply.
@dajac dajac force-pushed the worktree-KAFKA-20634 branch from 4eb586d to ee3a6be Compare June 1, 2026 19:02
@dajac dajac requested a review from squah-confluent June 1, 2026 19:03
Copy link
Copy Markdown
Contributor

@squah-confluent squah-confluent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch!

I think the new API contract for registerListener is unusual. However I don't have a better suggestion right now.

Comment on lines +37 to +39
* High watermark updates are delivered only while this broker is the leader of the
* partition. Once the partition is no longer led by this broker (it transitions to
* follower, is deleted or fails), no further updates are delivered. This guarantee
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the first pass, I read this as "updates are only delivered while the broker is the leader" instead of "updates stop forever once the broker is no longer the leader", probably because of the first sentence. Maybe we can make this clearer?

no further updates are delivered, even if the broker regains leadership.?

}

/**
* Register a {@link Listener}.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also call out the unexpected behavior on the registerListener javadoc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants