feat(parquet_writer): truncate manifest bounds for STRING/BINARY/FIXED to match Java#2544
Open
SreeramGarlapati wants to merge 4 commits into
Open
feat(parquet_writer): truncate manifest bounds for STRING/BINARY/FIXED to match Java#2544SreeramGarlapati wants to merge 4 commits into
SreeramGarlapati wants to merge 4 commits into
Conversation
…tch Java Apply Iceberg's default 16-unit bound truncation to manifest lower/upper bounds for STRING (codepoint-based), BINARY, and FIXED (byte-based) when collecting per-row-group statistics in `MinMaxColAggregator`. This mirrors Java's `org.apache.iceberg.util.UnicodeUtil#truncateStringMin/Max` and `BinaryUtil#truncateBinaryMin/Max`, called from `ParquetUtil#updateMin/Max`. Without this, long values produced manifest bounds that exceeded the conventional 16-unit budget and didn't agree with bounds Spark/Java would have written for the same data. Upper-bound truncation: take the 16-unit prefix, then increment the last unit; on overflow drop that position and try the previous one. If every position in the prefix is at max, we cannot produce a sound upper bound and drop it (matches Java semantics; lower bound is still recorded). For STRING upper bounds we walk past UTF-16 surrogates (U+D800-U+DFFF) when incrementing because Rust's `char::from_u32` rejects them; Java's `Character.isValidCodePoint` accepts surrogates, but skipping them in Rust preserves monotonic ordering for valid UTF-8. Tests added (18): - 13 helper unit tests covering short input, long input, overflow drop, all-max fallback, and the UTF-16 surrogate skip - 4 aggregator tests for STRING/BINARY truncation behavior and the drop-only-upper case - 1 end-to-end tokio test that writes long-string rows through ParquetWriter and asserts the resulting `data_file.lower_bounds()` / `upper_bounds()`
…ed chunk Round 1 review fixes for the manifest-bound truncation in MinMaxColAggregator: - truncate_lower_bound / truncate_upper_bound for Fixed(N) now keep the column's declared PrimitiveType::Fixed(N) instead of re-typing as Fixed(<truncated_len>) via Datum::fixed. Use Datum::new(ty, Binary(bytes)) so downstream code that introspects datum.data_type() continues to see the schema's declared length. - MinMaxColAggregator now tracks an upper_unbounded set. When truncate_upper_bound returns None for any row group's max, the column's partial upper bound is dropped and further updates are blocked. Without this, an earlier row group's small upper bound could be left in place while a later row group's true max strictly exceeds it, producing a manifest upper_bound < true_max and breaking scan-time pruning. - Doc-comment on update() corrected: Java's ParquetUtil#updateMin/updateMax does not consult isMinExact/isMaxExact; we mirror that. - Doc on truncate_string_max calls out the Java-equivalent UTF-16 surrogate jump from U+D7FF to U+E000 (relevant for apache#2486). - Note on truncate-then-compare equivalence with Java's compare-then-truncate added inline. - Tests: extracted single_primitive_field_schema helper; added test_min_max_aggregator_merges_truncated_strings_across_row_groups, test_min_max_aggregator_drops_upper_after_unbounded_row_group, test_truncate_lower_upper_bound_fixed_preserves_declared_type, test_min_max_aggregator_truncates_long_fixed_bounds.
…datum After truncation, a Fixed(N) Datum carries fewer than N bytes in its literal even though the declared type says N. Add a regression test that exercises the two paths downstream consumers actually use: - `Datum::to_bytes()` — manifest single-value serialization writes the literal bytes verbatim regardless of declared Fixed length. - `PartialOrd` — wildcards on Fixed length and compares lex on raw bytes, so two truncated Fixed datums (16 bytes typed Fixed(20)) order correctly relative to each other. This locks the contract that future changes to Datum / PrimitiveLiteral serialization must preserve.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
What changes are included in this PR?
When
MinMaxColAggregatorcollects per-row-group parquet statistics incrates/iceberg/src/writer/file_writer/parquet_writer.rs, the resultinglower_bounds/upper_boundsare now truncated to match Java'sorg.apache.iceberg.util.UnicodeUtil(codepoint-based, forSTRING) andBinaryUtil(byte-based, forBINARYandFIXED), at Iceberg's default 16-unit length.Without this, long string/binary values produced manifest bounds that exceeded the conventional 16-unit budget and didn't agree with bounds Java/Spark would have written for the same data. In a two-writer setup where Java/Spark performs DDL/compaction on tables that iceberg-rust appends to, bounds disagreement breaks scan-time min/max pruning correctness for downstream readers.
The change has two pieces:
truncate_string_min/max(codepoint, mirrorsUnicodeUtil) andtruncate_binary_min/max(bytes, mirrorsBinaryUtil) inparquet_writer.rs.truncate_lower_bound/truncate_upper_bounddispatchers over(PrimitiveType, Datum)for the only three types that need truncation (String,Binary,Fixed(N)); other primitives pass through unchanged.Fixed(N), the truncatedDatumkeeps the column's declaredPrimitiveType::Fixed(N)(usingDatum::new) so downstream code that introspectsdatum.data_type()keeps seeing the schema's length.char::from_u32rejects them; Java'sincrementCodePointperforms the sameU+D7FF -> U+E000jump, so the produced bound matches Java for any valid&str.MinMaxColAggregator::updateno longer drops Parquet stats whosemin_is_exact/max_is_exactisfalse. Java'sParquetUtil#updateMin/updateMaxdoes not consult those flags — it always feeds the parquet-reported value throughBinaryUtil/UnicodeUtiltruncation. We mirror that by usingmin_bytes_opt/max_bytes_optto detect presence; a parquet-prefix-truncated min is still<=every value, and a parquet-truncated max is still>=every value, so the secondary 16-unit Iceberg truncation is sound. Without this, long-string columns whose Parquet writer already truncated stats had no manifest bounds at all. The regression test istest_min_max_aggregator_keeps_inexact_string_stats.Multi-row-group correctness: when one row group's max is unboundable (truncate-and-increment returns
None— e.g. allchar::MAX/ all0xFF), the aggregator drops the column's upper bound entirely and prevents future updates from re-adding it. Without this, an earlier row group's small upper bound could be left in place while a later row group's true max strictly exceeds it, producing a manifestupper_bound < true_max.API surface: none. All new items are private to the file. No public API changes.
Out of scope (intentionally deferred):
MetricsConfigplumbing — per-column truncate-length, full-column-disable, count-only modes (covered by feat(writer): honor write.metadata.metrics.* properties — bound truncation missing in MinMaxColAggregator #2362).partition_value_from_bounds(currently#[allow(dead_code)]and uncalled in-tree) — thelower == uppercheck there would need to compare against untruncated bounds; that work belongs with the consumer ofpartition_value_from_bounds.Are these changes tested?
Yes.
cargo test -p iceberg --lib→ 1318 passed / 0 failed.cargo clippy -p iceberg --lib --testsclean.cargo fmtapplied.22 new tests under
crates/iceberg/src/writer/file_writer/parquet_writer.rs::tests:Truncation helpers (13):
test_truncate_string_min_short_input_unchangedtest_truncate_string_min_long_input_truncates_codepointstest_truncate_string_max_short_input_unchangedtest_truncate_string_max_long_input_increments_last_codepointtest_truncate_string_max_overflow_drops_positiontest_truncate_string_max_skips_utf16_surrogatestest_truncate_string_max_all_max_returns_nonetest_truncate_binary_min_short_input_unchangedtest_truncate_binary_min_long_input_truncatestest_truncate_binary_max_short_input_unchangedtest_truncate_binary_max_long_input_increments_last_bytetest_truncate_binary_max_drops_trailing_0xfftest_truncate_binary_max_all_ff_returns_noneAggregator (8):
test_min_max_aggregator_keeps_inexact_string_stats(regression for the bundled fix)test_min_max_aggregator_truncates_long_string_boundstest_min_max_aggregator_truncates_long_binary_boundstest_min_max_aggregator_truncates_long_fixed_boundstest_min_max_aggregator_drops_only_upper_when_unboundabletest_min_max_aggregator_merges_truncated_strings_across_row_groupstest_min_max_aggregator_drops_upper_after_unbounded_row_grouptest_truncate_lower_upper_bound_fixed_preserves_declared_typeEnd-to-end (1):
test_parquet_writer_truncates_long_string_bounds— writes long-string rows throughParquetWriterand assertsdata_file.lower_bounds()/upper_bounds()match the Java-equivalent 16-codepoint truncation.