Optimize DLQ segment directory scans with single-pass logic.#18970
Optimize DLQ segment directory scans with single-pass logic.#18970mashhurs merged 7 commits intoelastic:mainfrom
Conversation
🤖 GitHub commentsJust comment with:
|
There was a problem hiding this comment.
Pull request overview
This PR optimizes Dead Letter Queue (DLQ) segment file lookups by avoiding full directory materialization + sorting when only the min/max segment is needed, using a single-pass DirectoryStream scan with OS-level glob filtering.
Changes:
- Replace multi-step segment index discovery with
DeadLetterQueueUtils.maxSegmentId(...). - Replace sorted segment-path lookups with
DeadLetterQueueUtils.oldestSegmentPath(...)for selecting the oldest segment (with optional size filtering). - Remove now-unused sorted-list helper and adjust callers to use the updated utilities.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
logstash-core/src/main/java/org/logstash/common/io/DeadLetterQueueWriter.java |
Switches writer initialization and oldest-segment selection to new single-pass utility methods. |
logstash-core/src/main/java/org/logstash/common/io/DeadLetterQueueUtils.java |
Adds single-pass maxSegmentId/oldestSegmentPath implementations using DirectoryStream globbing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… lookups Replace listSegmentPathsSortedBySegmentId (which materialized all paths, sorted O(N log N), then took the first element) with purpose-built maxSegmentId and oldestSegmentPath utilities that use Files.newDirectoryStream with OS-level glob filtering and a single O(N) pass. Also narrow listFiles to compare only the filename component instead of the full path, and consolidate duplicate segment ID parsing in DeadLetterQueueWriter to reuse extractSegmentId.
661bf42 to
d1960dc
Compare
…ueueUtils.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| } | ||
| } | ||
|
|
||
| @Test |
There was a problem hiding this comment.
moved to dedicated DeadLetterQueueUtilsTest.java space
|
Hi @mashhurs which is the baseline that measure the existing implementation? |
andsel
left a comment
There was a problem hiding this comment.
I think the idea good to avoid the sorting to find min and max. I don't know if the big differentiator is the usage of DirectoryStream over the listing of files. However, I'm in favor of using it.
Left a question in a separate comment to understand which is the baseline in the performance analysis you have done.
I've suggested the usage of a filter interface and asked some clarification on a javadoc comment.
Refine the code comment. Co-authored-by: Mashhur <99575341+mashhurs@users.noreply.github.com>
The benchmarks are on the logics "before this PR" and "with this PR". I have placed them in my separate remote repo branch (also added in this PR description) - |
…nstead of rescanning filesystem
andsel
left a comment
There was a problem hiding this comment.
LGTM
Left a suggestion on the Javadoc and thanks for checking my suggestion about DirectoryStream's filtering.
I mean that to have a comparison of performance, we need a clear definition of which is the baseline before the changes. In the table presented in the "Why is it important/What is the impact to the user?" we have 3 columns:
There is no clear indication of the original baseline, I suppose it's "UsingStream" but given that in the description it's also cited DirectoryStream it's not clear to which stream it refers. |
…ueueUtils.java Apply Java doc suggestion, provides clearer signal. Co-authored-by: Andrea Selva <selva.andre@gmail.com>
Ah I thought, I added to the PR description 🤦 , just added sorry for that. |
💛 Build succeeded, but was flaky
Failed CI StepsHistory
cc @mashhurs |
* Optimize DLQ segment directory scans with single-pass DirectoryStream lookups Before this change, listing segment files and finding max segment ID logic was using plain Java stream to list all files, then filter by size and sort. With this PR change, we optimize DLQ segment file lookups to use single-pass directory scans. Use DirectoryStream with OS-level glob instead of listing all files, find the min or max segment. There are use-cases which require size > 0 when updating oldest file segment and no size check when removing oldest segment file which will be handled in a single logic. * Move file size condition after the extract segment ID. * Add unit tests * Update logstash-core/src/main/java/org/logstash/common/io/DeadLetterQueueUtils.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestions from code review Refine the code comment. Co-authored-by: Mashhur <99575341+mashhurs@users.noreply.github.com> * When removing the segment, track DLQ currentQueueSize incrementally instead of rescanning filesystem * Update logstash-core/src/main/java/org/logstash/common/io/DeadLetterQueueUtils.java Apply Java doc suggestion, provides clearer signal. Co-authored-by: Andrea Selva <selva.andre@gmail.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Andrea Selva <selva.andre@gmail.com> (cherry picked from commit 894ca21)
* Optimize DLQ segment directory scans with single-pass DirectoryStream lookups Before this change, listing segment files and finding max segment ID logic was using plain Java stream to list all files, then filter by size and sort. With this PR change, we optimize DLQ segment file lookups to use single-pass directory scans. Use DirectoryStream with OS-level glob instead of listing all files, find the min or max segment. There are use-cases which require size > 0 when updating oldest file segment and no size check when removing oldest segment file which will be handled in a single logic. * Move file size condition after the extract segment ID. * Add unit tests * Update logstash-core/src/main/java/org/logstash/common/io/DeadLetterQueueUtils.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestions from code review Refine the code comment. Co-authored-by: Mashhur <99575341+mashhurs@users.noreply.github.com> * When removing the segment, track DLQ currentQueueSize incrementally instead of rescanning filesystem * Update logstash-core/src/main/java/org/logstash/common/io/DeadLetterQueueUtils.java Apply Java doc suggestion, provides clearer signal. Co-authored-by: Andrea Selva <selva.andre@gmail.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Andrea Selva <selva.andre@gmail.com> (cherry picked from commit 894ca21)
|
@Mergifyio backport 9.4 |
✅ Backports have been createdDetails
|
* Optimize DLQ segment directory scans with single-pass DirectoryStream lookups Before this change, listing segment files and finding max segment ID logic was using plain Java stream to list all files, then filter by size and sort. With this PR change, we optimize DLQ segment file lookups to use single-pass directory scans. Use DirectoryStream with OS-level glob instead of listing all files, find the min or max segment. There are use-cases which require size > 0 when updating oldest file segment and no size check when removing oldest segment file which will be handled in a single logic. * Move file size condition after the extract segment ID. * Add unit tests * Update logstash-core/src/main/java/org/logstash/common/io/DeadLetterQueueUtils.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestions from code review Refine the code comment. Co-authored-by: Mashhur <99575341+mashhurs@users.noreply.github.com> * When removing the segment, track DLQ currentQueueSize incrementally instead of rescanning filesystem * Update logstash-core/src/main/java/org/logstash/common/io/DeadLetterQueueUtils.java Apply Java doc suggestion, provides clearer signal. Co-authored-by: Andrea Selva <selva.andre@gmail.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Andrea Selva <selva.andre@gmail.com> (cherry picked from commit 894ca21)
…#19012) * Optimize DLQ segment directory scans with single-pass DirectoryStream lookups Before this change, listing segment files and finding max segment ID logic was using plain Java stream to list all files, then filter by size and sort. With this PR change, we optimize DLQ segment file lookups to use single-pass directory scans. Use DirectoryStream with OS-level glob instead of listing all files, find the min or max segment. There are use-cases which require size > 0 when updating oldest file segment and no size check when removing oldest segment file which will be handled in a single logic. * Move file size condition after the extract segment ID. * Add unit tests * Update logstash-core/src/main/java/org/logstash/common/io/DeadLetterQueueUtils.java * Apply suggestions from code review Refine the code comment. * When removing the segment, track DLQ currentQueueSize incrementally instead of rescanning filesystem * Update logstash-core/src/main/java/org/logstash/common/io/DeadLetterQueueUtils.java Apply Java doc suggestion, provides clearer signal. --------- (cherry picked from commit 894ca21) Co-authored-by: Mashhur <99575341+mashhurs@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Andrea Selva <selva.andre@gmail.com>
…#19011) * Optimize DLQ segment directory scans with single-pass DirectoryStream lookups Before this change, listing segment files and finding max segment ID logic was using plain Java stream to list all files, then filter by size and sort. With this PR change, we optimize DLQ segment file lookups to use single-pass directory scans. Use DirectoryStream with OS-level glob instead of listing all files, find the min or max segment. There are use-cases which require size > 0 when updating oldest file segment and no size check when removing oldest segment file which will be handled in a single logic. * Move file size condition after the extract segment ID. * Add unit tests * Update logstash-core/src/main/java/org/logstash/common/io/DeadLetterQueueUtils.java * Apply suggestions from code review Refine the code comment. * When removing the segment, track DLQ currentQueueSize incrementally instead of rescanning filesystem * Update logstash-core/src/main/java/org/logstash/common/io/DeadLetterQueueUtils.java Apply Java doc suggestion, provides clearer signal. --------- (cherry picked from commit 894ca21) Co-authored-by: Mashhur <99575341+mashhurs@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Andrea Selva <selva.andre@gmail.com>
…#19013) * Optimize DLQ segment directory scans with single-pass DirectoryStream lookups Before this change, listing segment files and finding max segment ID logic was using plain Java stream to list all files, then filter by size and sort. With this PR change, we optimize DLQ segment file lookups to use single-pass directory scans. Use DirectoryStream with OS-level glob instead of listing all files, find the min or max segment. There are use-cases which require size > 0 when updating oldest file segment and no size check when removing oldest segment file which will be handled in a single logic. * Move file size condition after the extract segment ID. * Add unit tests * Update logstash-core/src/main/java/org/logstash/common/io/DeadLetterQueueUtils.java * Apply suggestions from code review Refine the code comment. * When removing the segment, track DLQ currentQueueSize incrementally instead of rescanning filesystem * Update logstash-core/src/main/java/org/logstash/common/io/DeadLetterQueueUtils.java Apply Java doc suggestion, provides clearer signal. --------- (cherry picked from commit 894ca21) Co-authored-by: Mashhur <99575341+mashhurs@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Andrea Selva <selva.andre@gmail.com>
Release notes
Performance improvements which saves ~40% CPU resource on DLQ segment file lookup operations.
What does this PR do?
~40% improvement on DLQ segment logics.
Before this change, listing segment files and finding max segment ID logic was using plain Java stream (
UsingStreambenchmark) to list all files, then filter by size and sort.With this PR change, we optimize DLQ segment file lookups to use single-pass directory scans.
Use DirectoryStream with OS-level glob instead of listing all files, find the min or max segment. There are use-cases which require size > 0 (
WithMinSizein benchmarks) when updating oldest file segment and no size check when removing oldest segment file (NoMinSizein benchmarks) which will be handled in a single logic.Why is it important/What is the impact to the user?
Improves the performances of the LS pipelines heavily using DLQs.
Old logic and Benchmarks can be seen here - https://github.com/elastic/logstash/compare/main...mashhurs:logstash:dlq-benchmark-test?expand=1
Raw JMH data:
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files (and/or docker env variables)Author's Checklist
How to test this PR locally
Related issues
Use cases
Screenshots
Logs