feat(parquet_writer): truncate manifest bounds for STRING/BINARY/FIXED to match Java by SreeramGarlapati · Pull Request #2544 · apache/iceberg-rust

SreeramGarlapati · 2026-05-31T03:27:56Z

Which issue does this PR close?

Refs feat(writer): honor write.metadata.metrics.* properties — bound truncation missing in MinMaxColAggregator #2362
Refs discussion: bucket[N] / truncate[N] upper bound diverges from Java reference impl #2486

What changes are included in this PR?

When MinMaxColAggregator collects per-row-group parquet statistics in crates/iceberg/src/writer/file_writer/parquet_writer.rs, the resulting lower_bounds / upper_bounds are now truncated to match Java's org.apache.iceberg.util.UnicodeUtil (codepoint-based, for STRING) and BinaryUtil (byte-based, for BINARY and FIXED), at Iceberg's default 16-unit length.

Without this, long string/binary values produced manifest bounds that exceeded the conventional 16-unit budget and didn't agree with bounds Java/Spark would have written for the same data. In a two-writer setup where Java/Spark performs DDL/compaction on tables that iceberg-rust appends to, bounds disagreement breaks scan-time min/max pruning correctness for downstream readers.

The change has two pieces:

Iceberg-default truncation (16 unit) applied at the manifest layer:
- new private helpers truncate_string_min/max (codepoint, mirrors UnicodeUtil) and truncate_binary_min/max (bytes, mirrors BinaryUtil) in parquet_writer.rs.
- truncate_lower_bound / truncate_upper_bound dispatchers over (PrimitiveType, Datum) for the only three types that need truncation (String, Binary, Fixed(N)); other primitives pass through unchanged.
- For Fixed(N), the truncated Datum keeps the column's declared PrimitiveType::Fixed(N) (using Datum::new) so downstream code that introspects datum.data_type() keeps seeing the schema's length.
- Upper-bound increment walks past UTF-16 surrogates (U+D800–U+DFFF) because Rust's char::from_u32 rejects them; Java's incrementCodePoint performs the same U+D7FF -> U+E000 jump, so the produced bound matches Java for any valid &str.
Bundled fix to inexact-stats handling: MinMaxColAggregator::update no longer drops Parquet stats whose min_is_exact/max_is_exact is false. Java's ParquetUtil#updateMin/updateMax does not consult those flags — it always feeds the parquet-reported value through BinaryUtil/UnicodeUtil truncation. We mirror that by using min_bytes_opt/max_bytes_opt to detect presence; a parquet-prefix-truncated min is still <= every value, and a parquet-truncated max is still >= every value, so the secondary 16-unit Iceberg truncation is sound. Without this, long-string columns whose Parquet writer already truncated stats had no manifest bounds at all. The regression test is test_min_max_aggregator_keeps_inexact_string_stats.

Multi-row-group correctness: when one row group's max is unboundable (truncate-and-increment returns None — e.g. all char::MAX / all 0xFF), the aggregator drops the column's upper bound entirely and prevents future updates from re-adding it. Without this, an earlier row group's small upper bound could be left in place while a later row group's true max strictly exceeds it, producing a manifest upper_bound < true_max.

API surface: none. All new items are private to the file. No public API changes.

Out of scope (intentionally deferred):

Full MetricsConfig plumbing — per-column truncate-length, full-column-disable, count-only modes (covered by feat(writer): honor write.metadata.metrics.* properties — bound truncation missing in MinMaxColAggregator #2362).
Wiring the new bound length to partition_value_from_bounds (currently #[allow(dead_code)] and uncalled in-tree) — the lower == upper check there would need to compare against untruncated bounds; that work belongs with the consumer of partition_value_from_bounds.

Are these changes tested?

Yes. cargo test -p iceberg --lib → 1318 passed / 0 failed. cargo clippy -p iceberg --lib --tests clean. cargo fmt applied.

22 new tests under crates/iceberg/src/writer/file_writer/parquet_writer.rs::tests:

Truncation helpers (13):

test_truncate_string_min_short_input_unchanged
test_truncate_string_min_long_input_truncates_codepoints
test_truncate_string_max_short_input_unchanged
test_truncate_string_max_long_input_increments_last_codepoint
test_truncate_string_max_overflow_drops_position
test_truncate_string_max_skips_utf16_surrogates
test_truncate_string_max_all_max_returns_none
test_truncate_binary_min_short_input_unchanged
test_truncate_binary_min_long_input_truncates
test_truncate_binary_max_short_input_unchanged
test_truncate_binary_max_long_input_increments_last_byte
test_truncate_binary_max_drops_trailing_0xff
test_truncate_binary_max_all_ff_returns_none

Aggregator (8):

test_min_max_aggregator_keeps_inexact_string_stats (regression for the bundled fix)
test_min_max_aggregator_truncates_long_string_bounds
test_min_max_aggregator_truncates_long_binary_bounds
test_min_max_aggregator_truncates_long_fixed_bounds
test_min_max_aggregator_drops_only_upper_when_unboundable
test_min_max_aggregator_merges_truncated_strings_across_row_groups
test_min_max_aggregator_drops_upper_after_unbounded_row_group
test_truncate_lower_upper_bound_fixed_preserves_declared_type

End-to-end (1):

test_parquet_writer_truncates_long_string_bounds — writes long-string rows through ParquetWriter and asserts data_file.lower_bounds() / upper_bounds() match the Java-equivalent 16-codepoint truncation.

…tch Java Apply Iceberg's default 16-unit bound truncation to manifest lower/upper bounds for STRING (codepoint-based), BINARY, and FIXED (byte-based) when collecting per-row-group statistics in `MinMaxColAggregator`. This mirrors Java's `org.apache.iceberg.util.UnicodeUtil#truncateStringMin/Max` and `BinaryUtil#truncateBinaryMin/Max`, called from `ParquetUtil#updateMin/Max`. Without this, long values produced manifest bounds that exceeded the conventional 16-unit budget and didn't agree with bounds Spark/Java would have written for the same data. Upper-bound truncation: take the 16-unit prefix, then increment the last unit; on overflow drop that position and try the previous one. If every position in the prefix is at max, we cannot produce a sound upper bound and drop it (matches Java semantics; lower bound is still recorded). For STRING upper bounds we walk past UTF-16 surrogates (U+D800-U+DFFF) when incrementing because Rust's `char::from_u32` rejects them; Java's `Character.isValidCodePoint` accepts surrogates, but skipping them in Rust preserves monotonic ordering for valid UTF-8. Tests added (18): - 13 helper unit tests covering short input, long input, overflow drop, all-max fallback, and the UTF-16 surrogate skip - 4 aggregator tests for STRING/BINARY truncation behavior and the drop-only-upper case - 1 end-to-end tokio test that writes long-string rows through ParquetWriter and asserts the resulting `data_file.lower_bounds()` / `upper_bounds()`

…ed chunk Round 1 review fixes for the manifest-bound truncation in MinMaxColAggregator: - truncate_lower_bound / truncate_upper_bound for Fixed(N) now keep the column's declared PrimitiveType::Fixed(N) instead of re-typing as Fixed(<truncated_len>) via Datum::fixed. Use Datum::new(ty, Binary(bytes)) so downstream code that introspects datum.data_type() continues to see the schema's declared length. - MinMaxColAggregator now tracks an upper_unbounded set. When truncate_upper_bound returns None for any row group's max, the column's partial upper bound is dropped and further updates are blocked. Without this, an earlier row group's small upper bound could be left in place while a later row group's true max strictly exceeds it, producing a manifest upper_bound < true_max and breaking scan-time pruning. - Doc-comment on update() corrected: Java's ParquetUtil#updateMin/updateMax does not consult isMinExact/isMaxExact; we mirror that. - Doc on truncate_string_max calls out the Java-equivalent UTF-16 surrogate jump from U+D7FF to U+E000 (relevant for apache#2486). - Note on truncate-then-compare equivalence with Java's compare-then-truncate added inline. - Tests: extracted single_primitive_field_schema helper; added test_min_max_aggregator_merges_truncated_strings_across_row_groups, test_min_max_aggregator_drops_upper_after_unbounded_row_group, test_truncate_lower_upper_bound_fixed_preserves_declared_type, test_min_max_aggregator_truncates_long_fixed_bounds.

…datum After truncation, a Fixed(N) Datum carries fewer than N bytes in its literal even though the declared type says N. Add a regression test that exercises the two paths downstream consumers actually use: - `Datum::to_bytes()` — manifest single-value serialization writes the literal bytes verbatim regardless of declared Fixed length. - `PartialOrd` — wildcards on Fixed length and compares lex on raw bytes, so two truncated Fixed datums (16 bytes typed Fixed(20)) order correctly relative to each other. This locks the contract that future changes to Datum / PrimitiveLiteral serialization must preserve.

SreeramGarlapati added 3 commits May 30, 2026 20:27

SreeramGarlapati changed the title ~~feat(parquet): truncate manifest bounds for STRING/BINARY/FIXED to match Java~~ feat(parquet_writer): truncate manifest bounds for STRING/BINARY/FIXED to match Java May 31, 2026

chore(ci): retrigger CI (transient infra flake on check (macos-latest))

6702559

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parquet_writer): truncate manifest bounds for STRING/BINARY/FIXED to match Java#2544

feat(parquet_writer): truncate manifest bounds for STRING/BINARY/FIXED to match Java#2544
SreeramGarlapati wants to merge 4 commits into
apache:mainfrom
SreeramGarlapati:schema-evolution-v2

SreeramGarlapati commented May 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SreeramGarlapati commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SreeramGarlapati commented May 31, 2026 •

edited

Loading