refactor: Nack Messages by default to avoid acking failed messages by yhl25 · Pull Request #3342 · numaproj/numaflow

yhl25 · 2026-03-31T18:39:20Z

Previously, there was a potential for messages to be accidentally acked if a panic or error occurred mid-processing. This could happen because the default behavior was to ack messages when dropped, so any message that didn't get explicitly marked as failed could be silently acked and lost.

This PR introduces a MessageHandle type that wraps messages and tracks acknowledgment state throughout the pipeline. The default behavior is now to nack, and a message is only acked after it has been fully processed and all downstream messages have been successfully written. Explicit mark_success() and mark_failed() calls replace the previous manual isFailed flag manipulation, making the ack/nack logic clearer and safer.

This ensures that in panic or error scenarios, messages are correctly nacked and retried rather than being silently lost.

Signed-off-by: Yashash <yashashhl25@gmail.com>

Signed-off-by: Yashash Lokesh <yashashhl25@gmail.com>

codecov · 2026-03-31T21:54:28Z

Codecov Report

❌ Patch coverage is 87.36702% with 95 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.57%. Comparing base (06440fc) to head (3d0d209).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/numaflow-core/src/message.rs	77.11%	46 Missing ⚠️
rust/numaflow-core/src/monovertex/bypass_router.rs	82.19%	13 Missing ⚠️
rust/numaflow-core/src/watermark/source.rs	75.55%	11 Missing ⚠️
rust/numaflow-core/src/source.rs	80.55%	7 Missing ⚠️
rust/numaflow-core/src/sinker/sink.rs	89.09%	6 Missing ⚠️
rust/numaflow-core/src/pipeline/isb/writer.rs	92.42%	5 Missing ⚠️
rust/numaflow-core/src/mapper/map.rs	94.54%	3 Missing ⚠️
rust/numaflow-core/src/pipeline/isb/reader.rs	92.50%	3 Missing ⚠️
...ust/numaflow-core/src/reduce/wal/segment/append.rs	97.56%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3342      +/-   ##
==========================================
+ Coverage   82.55%   82.57%   +0.01%     
==========================================
  Files         306      306              
  Lines       74445    74510      +65     
==========================================
+ Hits        61460    61528      +68     
  Misses      12427    12427              
+ Partials      558      555       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

vaibhavtiwari33 · 2026-04-05T23:52:38Z

+    fn clone(&self) -> Self {
+        self.ack_handle
+            .ref_count
+            .fetch_add(1, std::sync::atomic::Ordering::SeqCst);


Why do we need sequential consistency for this? Cloning would get expensive. Especially since this is for each message.

vaibhavtiwari33 · 2026-04-05T23:56:44Z

-            if self.is_failed.load(std::sync::atomic::Ordering::Relaxed) {
-                ack_handle.send(ReadAck::Nak).expect("Failed to send nak");
+            // NAK if ref_count is not 0 (meaning not all references were marked as success)
+            if self.ref_count.load(std::sync::atomic::Ordering::SeqCst) != 0 {


Similarly, here, IMO we don't need sequential consistency here. Wrapping AckHandle in Arc guarantees that at the time of drop, there wouldn't exist another thread holding this AckHandle for it to increase the value of the ref_count.

vaibhavtiwari33 · 2026-04-06T00:09:26Z

+    pub(crate) fn mark_success(&self) {
+        self.ack_handle
+            .ref_count
+            .fetch_sub(1, std::sync::atomic::Ordering::SeqCst);


Ok, I see why we're probably using SeqCst, to try to avoid scenarios where ref_count becomes negative. But I think, at the end of the day, we still can get away with Relaxed ordering since AckHandle is wrapped in Arc which already uses Release/Acquire, so the final count during Drop will be correct.

Also, similar to how there can be bugs because of missing out on marking a message as success, we can also create bugs by marking a message as success too many times (ref_count < 0). Should we limit the ref_count to only be decreased till 0?

vaibhavtiwari33 · 2026-04-06T00:40:33Z

-                    message.ack_handle = Some(Arc::new(AckHandle::new(resp_ack_tx)));
+                    let ack_handle = Arc::new(AckHandle::new(resp_ack_tx));

                    // insert the offset and the ack one shot in the tracker.


Suggested change

// insert the offset and the ack one shot in the tracker.

// insert the message (with offset) into the tracker

vaibhavtiwari33 · 2026-04-06T03:03:59Z

-                ack_handle.send(ReadAck::Ack).expect("Failed to send ack");
+                let _ = ack_handle.send(ReadAck::Ack);


I agree that we shouldn't panic if sending the ack/nack fails, but since the tokio tasks spawned for listening to these ack/nacks aren't structured, we should have a warn log here to notify that the receiver task exit before receiving ack/nack.

vaibhavtiwari33 · 2026-04-06T03:30:05Z

+    /// Permit to achieve structured concurrency by ensuring we do not exceed the concurrency limit
+    /// and all the tasks are cleaned up when the component is shutting down.
+    permit: OwnedSemaphorePermit,
+    read_message: MessageHandle,


nit:
msg_handle: MessageHandle

vaibhavtiwari33 · 2026-04-06T03:31:54Z

    /// and all the tasks are cleaned up when the component is shutting down.
    pub permit: OwnedSemaphorePermit,
-    pub message: Message,
+    pub read_message: MessageHandle,


nit:
msg_handle: MessageHandle

vaibhavtiwari33 · 2026-04-06T13:57:09Z

+                        // Downstream messages are independent - they use a no-op AckHandle.
+                        // The original read_message is ACK'd via mark_success! below.
+                        let read_msg: MessageHandle = mapped_message.into();


Why a no-op AckHandle? Shouldn't the final ack/nack for the message only be triggered when downstream (isb writer/sink) finishes processing?

vaibhavtiwari33 · 2026-04-06T14:05:49Z

-                        let bypassed = if let Some(ref bypass_router) = self.bypass_router {
-                            bypass_router
-                                .try_bypass(mapped_message.clone())
+                        let read_msg: MessageHandle = mapped_message.into();


Same here, why independent ackHandles?

Signed-off-by: Yashash Lokesh <yashashhl25@gmail.com>

…ssage-handle

Signed-off-by: Yashash Lokesh <yashashhl25@gmail.com>

vaibhavtiwari33 · 2026-04-15T21:50:24Z

+pub(crate) struct SourceWatermarkEntry {
+    pub(crate) partition_id: u16,
+    pub(crate) event_time_ms: i64,
+}
+
+impl From<&MessageHandle> for SourceWatermarkEntry {
+    fn from(handle: &MessageHandle) -> Self {
+        let msg = handle.message();
+        let partition_id = match &msg.offset {
+            Offset::Int(o) => o.partition_idx,
+            Offset::String(o) => o.partition_idx,
+        };
+        Self {
+            partition_id,
+            event_time_ms: msg.event_time.timestamp_millis(),
+        }
+    }
+}


I don't think we need to introduce this tbh. We're abstracting away the partition idx extraction logic which is only being used once.

Let's remove this.

Keeping SourceWatermarkEntry so the watermark code doesn't need to know about MessageHandle. Without it we'd pass &[(u16, i64)] which is unclear. From is the standard way to do conversions in Rust.

I think without it, we were directly passing the Message around earlier, but fair that we want to limit the scope of info available for watermark code.

vaibhavtiwari33 · 2026-04-15T21:57:31Z

+    /// ref_count is not decremented, so the message will be NAK'd when the AckHandle is dropped.
+    /// The error is logged at NAK time.
+    pub(crate) fn mark_failed(self, reason: impl fmt::Display) {
+        *self.ack_handle.failure_reason.lock().unwrap() = Some(reason.to_string());


Failure reasons would be overwritten everytime we call mark_failed for the same ack handle.

Also, can we remove unwrap here?

vaibhavtiwari33 · 2026-04-15T22:04:54Z

+    /// mark_success is called. On drop: NAK if ref_count != 0, ACK if ref_count == 0.
+    ref_count: AtomicUsize,
+    /// Set by mark_failed to record why the message is being nacked.
+    failure_reason: Mutex<Option<String>>,


I'm not sure about the Mutex here.
Our final goal is to only capture the error message, it should not require locking overhead.

I see that we're overwriting the error message every time anyways. Let's use OnceLock to only capture the first mark_failure message. This will avoid Mutex usage.

vaibhavtiwari33 · 2026-04-16T19:31:51Z

@@ -519,23 +518,21 @@ impl<C: crate::typ::NumaflowTypeConfig> Source<C> {

                let mut ack_handles = vec![];


Can we change this to msg_handles instead, since that is what we're tracking in this vector. Will avoid confusions with actual ack_handle

Signed-off-by: Yashash Lokesh <yashashhl25@gmail.com>

# Conflicts: # rust/numaflow-core/src/mapper/map/batch.rs # rust/numaflow-core/src/reduce/pbq.rs # rust/numaflow-core/src/reduce/reducer/aligned/user_defined.rs

Signed-off-by: Yashash Lokesh <yashashhl25@gmail.com>

vaibhavtiwari33 · 2026-04-17T21:56:47Z

                    let mut fallback_messages: Vec<Message> = vec![];
                    let mut on_success_messages: Vec<Message> = vec![];
-                    let mut ack_handles = vec![];
+                    let mut read_messages: Vec<MessageHandle> = vec![];


nit: rename to msg_handles

vaibhavtiwari33 · 2026-04-19T19:47:49Z

    pub mapper: UserDefinedBatchMap,
-    pub batch: Vec<Message>,
-    pub output_tx: mpsc::Sender<Message>,
+    pub read_batch: Vec<MessageHandle>,


nit: rename read_batch

Signed-off-by: Yashash Lokesh <yashashhl25@gmail.com>

yhl25 added 6 commits February 9, 2026 11:10

read message

e0cfa4c

Signed-off-by: Yashash <yashashhl25@gmail.com>

ack handle should not be optional

15d0faa

Signed-off-by: Yashash <yashashhl25@gmail.com>

chore: resolve conflicts

3eb3530

Signed-off-by: Yashash Lokesh <yashashhl25@gmail.com>

chore: fix tests

6253635

Signed-off-by: Yashash Lokesh <yashashhl25@gmail.com>

Merge branch 'main' of github.com:numaproj/numaflow into message-handle

6580fe9

chore: resolve conflicts

597ce48

Signed-off-by: Yashash Lokesh <yashashhl25@gmail.com>

yhl25 requested a review from vaibhavtiwari33 March 31, 2026 18:39

yhl25 added 3 commits March 31, 2026 12:10

chore: fix uts and lint

634bef7

Signed-off-by: Yashash Lokesh <yashashhl25@gmail.com>

chore: fix test

65de7b2

Signed-off-by: Yashash Lokesh <yashashhl25@gmail.com>

chore: fix lint

bea91c7

Signed-off-by: Yashash Lokesh <yashashhl25@gmail.com>

Merge branch 'main' into message-handle

1e67939

vaibhavtiwari33 reviewed Apr 6, 2026

View reviewed changes

yhl25 added 6 commits April 6, 2026 17:01

address review comments

6b7cb62

Signed-off-by: Yashash Lokesh <yashashhl25@gmail.com>

Merge branch 'message-handle' of github.com:numaproj/numaflow into me…

7a67551

…ssage-handle

chore: mark failed to capture error

16743ce

Signed-off-by: Yashash Lokesh <yashashhl25@gmail.com>

chore: fix ut

eaf6bbf

Signed-off-by: Yashash Lokesh <yashashhl25@gmail.com>

chore: simplify sink

34499e8

Signed-off-by: Yashash Lokesh <yashashhl25@gmail.com>

use macro

e456bdb

Signed-off-by: Yashash Lokesh <yashashhl25@gmail.com>

vaibhavtiwari33 reviewed Apr 16, 2026

View reviewed changes

yhl25 added 3 commits April 20, 2026 09:48

chore: review comments

e648bbc

Signed-off-by: Yashash Lokesh <yashashhl25@gmail.com>

Merge remote-tracking branch 'origin/main' into message-handle

c99ec44

# Conflicts: # rust/numaflow-core/src/mapper/map/batch.rs # rust/numaflow-core/src/reduce/pbq.rs # rust/numaflow-core/src/reduce/reducer/aligned/user_defined.rs

chore: resolve conflicts

40a7266

Signed-off-by: Yashash Lokesh <yashashhl25@gmail.com>

yhl25 requested a review from vaibhavtiwari33 April 20, 2026 18:13

vaibhavtiwari33 approved these changes Apr 20, 2026

View reviewed changes

yhl25 added 2 commits April 20, 2026 17:53

chore: review comments

ac71d18

Signed-off-by: Yashash Lokesh <yashashhl25@gmail.com>

Merge branch 'main' into message-handle

3d0d209

yhl25 marked this pull request as ready for review April 21, 2026 15:54

yhl25 requested review from vigith and whynowy as code owners April 21, 2026 15:54

yhl25 merged commit 4ffd0b1 into main Apr 21, 2026
27 checks passed

yhl25 deleted the message-handle branch April 21, 2026 22:01

	// insert the offset and the ack one shot in the tracker.
	// insert the message (with offset) into the tracker

		ack_handle.send(ReadAck::Ack).expect("Failed to send ack");
		let _ = ack_handle.send(ReadAck::Ack);

		@@ -519,23 +518,21 @@ impl<C: crate::typ::NumaflowTypeConfig> Source<C> {

		let mut ack_handles = vec![];

Conversation

yhl25 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vaibhavtiwari33 Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vaibhavtiwari33 Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vaibhavtiwari33 Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yhl25 commented Mar 31, 2026 •

edited

Loading

codecov Bot commented Mar 31, 2026 •

edited

Loading

vaibhavtiwari33 Apr 6, 2026 •

edited

Loading

vaibhavtiwari33 Apr 6, 2026 •

edited

Loading

vaibhavtiwari33 Apr 19, 2026 •

edited

Loading