Core: Add streaming CloseableIterable accessors to SnapshotChanges#16390
Open
wombatu-kun wants to merge 1 commit into
Open
Core: Add streaming CloseableIterable accessors to SnapshotChanges#16390wombatu-kun wants to merge 1 commit into
wombatu-kun wants to merge 1 commit into
Conversation
SnapshotChanges previously exposed only cached accessors that eagerly materialize all file changes into in-memory lists and return Iterable. Some callers (for example replaced-partition validation, see apache#13556) need to stream changes without loading every file into memory. This adds streaming CloseableIterable accessors and re-implements the cached accessors as thin wrappers over them, as suggested in apache#15659. Changes: - Add addedDataFilesIterable(), removedDataFilesIterable(), addedDeleteFilesIterable() and removedDeleteFilesIterable(), which return lazily-evaluated CloseableIterables that the caller must close and that are not cached. - Re-implement the cached accessors (addedDataFiles(), removedDataFiles(), addedDeleteFiles(), removedDeleteFiles()) as materialize() wrappers over the streaming methods, preserving their existing caching identity contract. - Replace the per-type cache/read methods with a single generic manifest-reading pipeline. Manifests are still read single-threaded by default and in parallel with a bounded queue when an executor is configured via Builder.executeWith. Reading both added and removed changes through the cached accessors now performs two manifest passes instead of one. This keeps the change minimal and additive; a single-pass optimization can be added later if it proves necessary. Adds comprehensive tests in TestSnapshotChanges covering streaming added and removed data and delete files, equivalence with the cached results, non-caching semantics, statistics retention versus stripping, EXISTING-entry exclusion, snapshot-id manifest filtering, and the parallel execution path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #15659
What
SnapshotChangespreviously exposed only cached accessors that eagerly materialize all file changes into in-memory lists and returnIterable. Some callers (for example replaced-partition validation, see #13556) need to stream changes without loading every file into memory. This PR adds streamingCloseableIterableaccessors and re-implements the cached accessors as thin wrappers over them, exactly as suggested in #15659.Changes
addedDataFilesIterable(),removedDataFilesIterable(),addedDeleteFilesIterable()andremovedDeleteFilesIterable(), which return lazily-evaluatedCloseableIterables that the caller must close and that are not cached.addedDataFiles(),removedDataFiles(),addedDeleteFiles(),removedDeleteFiles()) asmaterialize()wrappers over the streaming methods, preserving their existing caching identity contract.Builder.executeWith.Trade-off
Reading both added and removed changes through the cached accessors now performs two manifest passes instead of one. This keeps the change minimal and strictly additive; a single-pass optimization can be added in a follow-up if it proves necessary.
Testing
Adds 10 focused tests in
TestSnapshotChanges(the 3 existing tests are unchanged), each guarding a distinct code path: streaming added/removed data and delete files, equivalence with the cached results, non-caching semantics, statistics retention versus stripping (copy()vscopyWithoutStats()), EXISTING-entry exclusion, snapshot-id manifest filtering, and the parallel execution path.:iceberg-core:test,spotlessCheckandrevapiall pass; the change is purely additive so no revapi exception is required.🤖 Generated with Claude Code