Improves a visibility the case events routed to DLQ.#1253
Improves a visibility the case events routed to DLQ.#1253mashhurs wants to merge 6 commits intologstash-plugins:mainfrom
Conversation
| begin | ||
| @dlq_writer.write(event, "#{detailed_message}") | ||
| rescue => e | ||
| @logger.error("Failed to write event to DLQ", |
There was a problem hiding this comment.
Something to note, this won't work when the DLQ is full. The write fails silently.
But it will fix the issue where the DLQ writer throws
There was a problem hiding this comment.
I have re-shaped the logic after number of testing that this catching DLQ writer exception change doesn't seem too valuable. Rather, displaying sample events for each failed code by ES gives us an opportunity to understand the situation better and if the issue can be immediately addressed.
There was a problem hiding this comment.
Just as a follow-up - what happens when the dlq_writer throws here, does the bulk request get retried, or does it get thrown away? Is there a danger of data loss?
There was a problem hiding this comment.
BTW, I have reverted this logic as I do see catching up logging event count sent to DLQ with sample events looks more valuable.
If your question is about generally how plugin behaves when dlq_writer throws an exception, events will be retried because action, status aren't changed (retry will be decided based on them) and exception will be caught in retrying_submit method.
I tested this behaviour yesterday, this commit was intentional to check data loss if plugin just swallows the exception -
bad8c0c
…ror happens during the DLQ persist, shows the status, action and response for better visibility.
e080b19 to
bad8c0c
Compare
…d just before handle_dlq_response
| @@ -296,6 +297,11 @@ def submit(actions) | |||
| elsif @dlq_codes.include?(status) | |||
| handle_dlq_response("Could not index event to Elasticsearch.", action, status, response) | |||
There was a problem hiding this comment.
WDYT about DLQ'd events going to the debug log?
There was a problem hiding this comment.
kk good question! Yesterday it took me ~15min to decide to go with warning. Why?!- I consider this case need an ATTENTION. But I am open to change if you have strong point.
There was a problem hiding this comment.
Sorry if there is a misunderstanding - my comment was whether we should consider the reason for all events going to the DLQ at a debug level, not the "summary" message, which I think a warn is appropriate, although I could be persuaded that it is an info level
|
@andsel (sorry for assigning this), I would like to get your opinion on this as well as we are touching DLQ domain together, specifically with this when es-output involves. Please read the PR description and do let me know if this makes sense or can improve more. |
andsel
left a comment
There was a problem hiding this comment.
Left a comment for typo in the variable name, I think.
Globally the idea click, let's check if that warn log line becomes to noisy to users. If there is a misconfiguration in the index, they could flooded by logs.
"how noisy logs" is a similar when we don't configure DLQ for the pipeline. |
|
I see, was jut questioning if we can imagine a better way to log errors like that, when we have repeating same logs in short burst. But it's out of context of this PR. |
Touches two improvements:
With this PR, I have added very minimal logic
from the failed bulk request. Note that this may still produce alot noise if there will be alot 400/404 rejected events for each bulk request.
Sample log from tests: