Skip to content

Improves a visibility the case events routed to DLQ.#1253

Open
mashhurs wants to merge 6 commits intologstash-plugins:mainfrom
mashhurs:dlq-improvements
Open

Improves a visibility the case events routed to DLQ.#1253
mashhurs wants to merge 6 commits intologstash-plugins:mainfrom
mashhurs:dlq-improvements

Conversation

@mashhurs
Copy link
Copy Markdown
Contributor

@mashhurs mashhurs commented Mar 24, 2026

Touches two improvements:

  • after bulk request, we don't know how many events failed and going to DLQ.
  • And the most importantly what error the plugin received from ES which might need to get a special attention, such as mapping exceptions.

With this PR, I have added very minimal logic

  • to show the number of events went to DLQ
  • one/first sample event for each status (400 & 404)
    from the failed bulk request. Note that this may still produce alot noise if there will be alot 400/404 rejected events for each bulk request.

Sample log from tests:

[2026-03-25T14:25:42,458][ERROR][org.logstash.common.io.DeadLetterQueueWriter][main][dcd2523f8211587168a7a70631cc7d39db3fd6c27afd716544d8dac593c851b9] Cannot write event to DLQ(path: /Users/mashhur/Dev/elastic/logstash/data/dead_letter_queue/main): reached maxQueueSize of 5242880
[2026-03-25T14:25:42,458][WARN ][logstash.outputs.elasticsearch][main][dcd2523f8211587168a7a70631cc7d39db3fd6c27afd716544d8dac593c851b9] Events could not be indexed and routing to DLQ {count: 125, dlq_reouted_stats: {404 => {count: 125, sample_event: {action: ["update", {_id: "nonexistent_id_12345", _index: "test-dlq", routing: nil, retry_on_conflict: 1}, {"event" => {"data" => {"User-Name" => "mashhur"}, "original" => "{\"fileset\":{\"module\":\"system\",\"name\":\"asa\"},\"system\":{\"auth\":{\"timestamp\":\"May 17  05:17:00\",\"ssh\":{\"source\":{\"ip\":\"54.160.29.55\"}}}},\"event\":{\"category\":\"cisco-category\", \"type\":\"cisco-type\", \"data\":{\"User-Name\":\"mashhur\"}},\"client\":{\"ar_net\":\"54.160.29.55\", \"ongisac_ip\":\"54.160.29.55\", \"ip\":\"54.160.29.55\"}, \"destination\": {\"ar_net\":\"54.160.29.55\", \"ongisac_ip\":\"54.160.29.55\"}, \"source\": {\"ar_net\":\"54.160.29.55\", \"ongisac_ip\":\"54.160.29.55\"}, \"url\":{\"ongisac_domain\": \"ip-172-31-4-132\"}, \"DstIP\":\"54.160.29.55\",\"SrcIP\":\"54.160.29.55\",\"orginalClientSrcIP\":\"54.160.29.55\",\"ReferencedHost\":\"ip-172-31-4-132\", \"dns\":{\"question\": {\"ongisac_domain\":\"example.com/my-path?query=value\"}}}", "category" => "cisco-category", "type" => "cisco-type"}, "host" => "Miks-M4", "fileset" => {"name" => "asa", "module" => "system"}, "dns" => {"question" => {"ongisac_domain" => "example.com/my-path?query=value"}}, "client" => {"ongisac_ip" => "54.160.29.55", "ip" => "54.160.29.55", "ar_net" => "54.160.29.55"}, "DstIP" => "54.160.29.55", "orginalClientSrcIP" => "54.160.29.55", "url" => {"ongisac_domain" => "ip-172-31-4-132"}, "ReferencedHost" => "ip-172-31-4-132", "@version" => "1", "system" => {"auth" => {"ssh" => {"source" => {"ip" => "54.160.29.55"}}, "timestamp" => "May 17  05:17:00"}}, "destination" => {"ongisac_ip" => "54.160.29.55", "ar_net" => "54.160.29.55"}, "SrcIP" => "54.160.29.55", "@timestamp" => 2026-03-25T21:25:42.377766Z, "sequence" => 493, "source" => {"ongisac_ip" => "54.160.29.55", "ar_net" => "54.160.29.55"}}], response: {"update" => {"status" => 404, "error" => {"type" => "document_missing_exception", "reason" => "[nonexistent_id_12345]: document missing", "index_uuid" => "yF524HiDT6-OlpSP3VSlgQ", "shard" => "0", "index" => "test-dlq"}}}}}}}

begin
@dlq_writer.write(event, "#{detailed_message}")
rescue => e
@logger.error("Failed to write event to DLQ",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to note, this won't work when the DLQ is full. The write fails silently.

But it will fix the issue where the DLQ writer throws

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have re-shaped the logic after number of testing that this catching DLQ writer exception change doesn't seem too valuable. Rather, displaying sample events for each failed code by ES gives us an opportunity to understand the situation better and if the issue can be immediately addressed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a follow-up - what happens when the dlq_writer throws here, does the bulk request get retried, or does it get thrown away? Is there a danger of data loss?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I have reverted this logic as I do see catching up logging event count sent to DLQ with sample events looks more valuable.
If your question is about generally how plugin behaves when dlq_writer throws an exception, events will be retried because action, status aren't changed (retry will be decided based on them) and exception will be caught in retrying_submit method.
I tested this behaviour yesterday, this commit was intentional to check data loss if plugin just swallows the exception -
bad8c0c

@mashhurs mashhurs marked this pull request as ready for review March 25, 2026 21:39
@@ -296,6 +297,11 @@ def submit(actions)
elsif @dlq_codes.include?(status)
handle_dlq_response("Could not index event to Elasticsearch.", action, status, response)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about DLQ'd events going to the debug log?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk good question! Yesterday it took me ~15min to decide to go with warning. Why?!- I consider this case need an ATTENTION. But I am open to change if you have strong point.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if there is a misunderstanding - my comment was whether we should consider the reason for all events going to the DLQ at a debug level, not the "summary" message, which I think a warn is appropriate, although I could be persuaded that it is an info level

@mashhurs mashhurs self-assigned this Apr 10, 2026
@mashhurs mashhurs linked an issue Apr 10, 2026 that may be closed by this pull request
@mashhurs mashhurs requested a review from andsel April 13, 2026 22:46
@mashhurs
Copy link
Copy Markdown
Contributor Author

@andsel (sorry for assigning this), I would like to get your opinion on this as well as we are touching DLQ domain together, specifically with this when es-output involves. Please read the PR description and do let me know if this makes sense or can improve more.

Copy link
Copy Markdown
Contributor

@andsel andsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment for typo in the variable name, I think.

Globally the idea click, let's check if that warn log line becomes to noisy to users. If there is a misconfiguration in the index, they could flooded by logs.

Comment thread lib/logstash/plugin_mixins/elasticsearch/common.rb Outdated
@mashhurs
Copy link
Copy Markdown
Contributor Author

Left a comment for typo in the variable name, I think.

Globally the idea click, let's check if that warn log line becomes to noisy to users. If there is a misconfiguration in the index, they could flooded by logs.

"how noisy logs" is a similar when we don't configure DLQ for the pipeline.

@mashhurs mashhurs requested a review from andsel April 14, 2026 21:54
@andsel
Copy link
Copy Markdown
Contributor

andsel commented Apr 15, 2026

I see, was jut questioning if we can imagine a better way to log errors like that, when we have repeating same logs in short burst. But it's out of context of this PR.

Copy link
Copy Markdown
Contributor

@andsel andsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if @robbavey doesn't spot any other issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

A visibility about events which are going into DLQ.

3 participants