KAFKA-20634: Spurious HighWatermarkUpdate failed errors in the group coordinator after partition leadership change#22444
Open
dajac wants to merge 1 commit into
Open
KAFKA-20634: Spurious HighWatermarkUpdate failed errors in the group coordinator after partition leadership change#22444dajac wants to merge 1 commit into
dajac wants to merge 1 commit into
Conversation
…coordinator after partition leadership change When a `__consumer_offsets` partition transitions to follower, its local log is truncated and re-replicated from the new leader. The group coordinator hosting the partition remains active until it is unloaded asynchronously. During that window, the partition's high watermark advances again over records that this coordinator did not write, while the coordinator still holds in-memory state (and pending deferred operations) for its own records that were truncated and never durably committed. Applying such a high watermark has two consequences. It can violate the invariants of the snapshot registry and fail the `HighWatermarkUpdate` event, logging a spurious error such as "Execution of HighWatermarkUpdate failed due to New committed offset X of __consumer_offsets-N must be less than or equal to Y". More importantly, when it does not fail, it advances the committed offset over the coordinator's uncommitted state and completes the corresponding deferred writes with a success response, even though those records were lost. A client can therefore receive a successful offset-commit acknowledgment for a commit that is silently dropped once the new coordinator takes over. This patch gates high watermark propagation in `CoordinatorPartitionWriter.ListenerAdapter` on the partition's leadership. The adapter stops forwarding high watermark updates once the partition transitions to follower, is deleted, or fails. The partition signals these transitions (via `PartitionListener`) before its fetcher is restarted (see `ReplicaManager#applyDelta`), i.e. before any such high watermark can be produced, so the coordinator never observes a high watermark that it should not apply. The pending deferred operations then remain in place and are failed with `NOT_COORDINATOR` when the coordinator is unloaded, so clients correctly retry against the new coordinator. Gating on leadership rather than inspecting the offset is deliberate: after truncation an offset can still have a snapshot in the registry while holding the new leader's data, so no offset-based check can tell whether a high watermark is safe to apply.
4eb586d to
ee3a6be
Compare
Contributor
squah-confluent
left a comment
There was a problem hiding this comment.
Thanks for the patch!
I think the new API contract for registerListener is unusual. However I don't have a better suggestion right now.
Comment on lines
+37
to
+39
| * High watermark updates are delivered only while this broker is the leader of the | ||
| * partition. Once the partition is no longer led by this broker (it transitions to | ||
| * follower, is deleted or fails), no further updates are delivered. This guarantee |
Contributor
There was a problem hiding this comment.
On the first pass, I read this as "updates are only delivered while the broker is the leader" instead of "updates stop forever once the broker is no longer the leader", probably because of the first sentence. Maybe we can make this clearer?
no further updates are delivered, even if the broker regains leadership.?
| } | ||
|
|
||
| /** | ||
| * Register a {@link Listener}. |
Contributor
There was a problem hiding this comment.
I think we should also call out the unexpected behavior on the registerListener javadoc.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a
__consumer_offsetspartition transitions to follower, its locallog is truncated and re-replicated from the new leader. The group
coordinator hosting the partition remains active until it is unloaded
asynchronously. During that window, the partition's high watermark
advances again over records that this coordinator did not write, while
the coordinator still holds in-memory state (and pending deferred
operations) for its own records that were truncated and never durably
committed.
Applying such a high watermark has two consequences. It can violate the
invariants of the snapshot registry and fail the
HighWatermarkUpdateevent, logging a spurious error such as "Execution of
HighWatermarkUpdate failed due to New committed offset X of
__consumer_offsets-N must be less than or equal to Y". More importantly,
when it does not fail, it advances the committed offset over the
coordinator's uncommitted state and completes the corresponding deferred
writes with a success response, even though those records were lost. A
client can therefore receive a successful offset-commit acknowledgment
for a commit that is silently dropped once the new coordinator takes
over.
This patch gates high watermark propagation in
CoordinatorPartitionWriter.ListenerAdapteron the partition'sleadership. The adapter stops forwarding high watermark updates once the
partition transitions to follower, is deleted, or fails. The partition
signals these transitions (via
PartitionListener) before its fetcheris restarted (see
ReplicaManager#applyDelta), i.e. before any suchhigh watermark can be produced, so the coordinator never observes a high
watermark that it should not apply. The pending deferred operations then
remain in place and are failed with
NOT_COORDINATORwhen thecoordinator is unloaded, so clients correctly retry against the new
coordinator.
Gating on leadership rather than inspecting the offset is deliberate:
after truncation an offset can still have a snapshot in the registry
while holding the new leader's data, so no offset-based check can tell
whether a high watermark is safe to apply.
Reviewers: Sean Quah squah@confluent.io