Skip to content

feat(parquet_writer): truncate manifest bounds for STRING/BINARY/FIXED to match Java#2544

Open
SreeramGarlapati wants to merge 4 commits into
apache:mainfrom
SreeramGarlapati:schema-evolution-v2
Open

feat(parquet_writer): truncate manifest bounds for STRING/BINARY/FIXED to match Java#2544
SreeramGarlapati wants to merge 4 commits into
apache:mainfrom
SreeramGarlapati:schema-evolution-v2

Conversation

@SreeramGarlapati
Copy link
Copy Markdown
Contributor

@SreeramGarlapati SreeramGarlapati commented May 31, 2026

Which issue does this PR close?

What changes are included in this PR?

When MinMaxColAggregator collects per-row-group parquet statistics in crates/iceberg/src/writer/file_writer/parquet_writer.rs, the resulting lower_bounds / upper_bounds are now truncated to match Java's org.apache.iceberg.util.UnicodeUtil (codepoint-based, for STRING) and BinaryUtil (byte-based, for BINARY and FIXED), at Iceberg's default 16-unit length.

Without this, long string/binary values produced manifest bounds that exceeded the conventional 16-unit budget and didn't agree with bounds Java/Spark would have written for the same data. In a two-writer setup where Java/Spark performs DDL/compaction on tables that iceberg-rust appends to, bounds disagreement breaks scan-time min/max pruning correctness for downstream readers.

The change has two pieces:

  1. Iceberg-default truncation (16 unit) applied at the manifest layer:
    • new private helpers truncate_string_min/max (codepoint, mirrors UnicodeUtil) and truncate_binary_min/max (bytes, mirrors BinaryUtil) in parquet_writer.rs.
    • truncate_lower_bound / truncate_upper_bound dispatchers over (PrimitiveType, Datum) for the only three types that need truncation (String, Binary, Fixed(N)); other primitives pass through unchanged.
    • For Fixed(N), the truncated Datum keeps the column's declared PrimitiveType::Fixed(N) (using Datum::new) so downstream code that introspects datum.data_type() keeps seeing the schema's length.
    • Upper-bound increment walks past UTF-16 surrogates (U+D800–U+DFFF) because Rust's char::from_u32 rejects them; Java's incrementCodePoint performs the same U+D7FF -> U+E000 jump, so the produced bound matches Java for any valid &str.
  2. Bundled fix to inexact-stats handling: MinMaxColAggregator::update no longer drops Parquet stats whose min_is_exact/max_is_exact is false. Java's ParquetUtil#updateMin/updateMax does not consult those flags — it always feeds the parquet-reported value through BinaryUtil/UnicodeUtil truncation. We mirror that by using min_bytes_opt/max_bytes_opt to detect presence; a parquet-prefix-truncated min is still <= every value, and a parquet-truncated max is still >= every value, so the secondary 16-unit Iceberg truncation is sound. Without this, long-string columns whose Parquet writer already truncated stats had no manifest bounds at all. The regression test is test_min_max_aggregator_keeps_inexact_string_stats.

Multi-row-group correctness: when one row group's max is unboundable (truncate-and-increment returns None — e.g. all char::MAX / all 0xFF), the aggregator drops the column's upper bound entirely and prevents future updates from re-adding it. Without this, an earlier row group's small upper bound could be left in place while a later row group's true max strictly exceeds it, producing a manifest upper_bound < true_max.

API surface: none. All new items are private to the file. No public API changes.

Out of scope (intentionally deferred):

Are these changes tested?

Yes. cargo test -p iceberg --lib → 1318 passed / 0 failed. cargo clippy -p iceberg --lib --tests clean. cargo fmt applied.

22 new tests under crates/iceberg/src/writer/file_writer/parquet_writer.rs::tests:

Truncation helpers (13):

  • test_truncate_string_min_short_input_unchanged
  • test_truncate_string_min_long_input_truncates_codepoints
  • test_truncate_string_max_short_input_unchanged
  • test_truncate_string_max_long_input_increments_last_codepoint
  • test_truncate_string_max_overflow_drops_position
  • test_truncate_string_max_skips_utf16_surrogates
  • test_truncate_string_max_all_max_returns_none
  • test_truncate_binary_min_short_input_unchanged
  • test_truncate_binary_min_long_input_truncates
  • test_truncate_binary_max_short_input_unchanged
  • test_truncate_binary_max_long_input_increments_last_byte
  • test_truncate_binary_max_drops_trailing_0xff
  • test_truncate_binary_max_all_ff_returns_none

Aggregator (8):

  • test_min_max_aggregator_keeps_inexact_string_stats (regression for the bundled fix)
  • test_min_max_aggregator_truncates_long_string_bounds
  • test_min_max_aggregator_truncates_long_binary_bounds
  • test_min_max_aggregator_truncates_long_fixed_bounds
  • test_min_max_aggregator_drops_only_upper_when_unboundable
  • test_min_max_aggregator_merges_truncated_strings_across_row_groups
  • test_min_max_aggregator_drops_upper_after_unbounded_row_group
  • test_truncate_lower_upper_bound_fixed_preserves_declared_type

End-to-end (1):

  • test_parquet_writer_truncates_long_string_bounds — writes long-string rows through ParquetWriter and asserts data_file.lower_bounds() / upper_bounds() match the Java-equivalent 16-codepoint truncation.

…tch Java

Apply Iceberg's default 16-unit bound truncation to manifest lower/upper
bounds for STRING (codepoint-based), BINARY, and FIXED (byte-based) when
collecting per-row-group statistics in `MinMaxColAggregator`.

This mirrors Java's `org.apache.iceberg.util.UnicodeUtil#truncateStringMin/Max`
and `BinaryUtil#truncateBinaryMin/Max`, called from `ParquetUtil#updateMin/Max`.
Without this, long values produced manifest bounds that exceeded the conventional
16-unit budget and didn't agree with bounds Spark/Java would have written for
the same data.

Upper-bound truncation: take the 16-unit prefix, then increment the last unit;
on overflow drop that position and try the previous one. If every position in
the prefix is at max, we cannot produce a sound upper bound and drop it (matches
Java semantics; lower bound is still recorded).

For STRING upper bounds we walk past UTF-16 surrogates (U+D800-U+DFFF) when
incrementing because Rust's `char::from_u32` rejects them; Java's
`Character.isValidCodePoint` accepts surrogates, but skipping them in Rust
preserves monotonic ordering for valid UTF-8.

Tests added (18):
- 13 helper unit tests covering short input, long input, overflow drop, all-max
  fallback, and the UTF-16 surrogate skip
- 4 aggregator tests for STRING/BINARY truncation behavior and the
  drop-only-upper case
- 1 end-to-end tokio test that writes long-string rows through ParquetWriter
  and asserts the resulting `data_file.lower_bounds()` / `upper_bounds()`
…ed chunk

Round 1 review fixes for the manifest-bound truncation in MinMaxColAggregator:

- truncate_lower_bound / truncate_upper_bound for Fixed(N) now keep the
  column's declared PrimitiveType::Fixed(N) instead of re-typing as
  Fixed(<truncated_len>) via Datum::fixed. Use Datum::new(ty, Binary(bytes))
  so downstream code that introspects datum.data_type() continues to see
  the schema's declared length.
- MinMaxColAggregator now tracks an upper_unbounded set. When
  truncate_upper_bound returns None for any row group's max, the column's
  partial upper bound is dropped and further updates are blocked. Without
  this, an earlier row group's small upper bound could be left in place
  while a later row group's true max strictly exceeds it, producing a
  manifest upper_bound < true_max and breaking scan-time pruning.
- Doc-comment on update() corrected: Java's ParquetUtil#updateMin/updateMax
  does not consult isMinExact/isMaxExact; we mirror that.
- Doc on truncate_string_max calls out the Java-equivalent UTF-16 surrogate
  jump from U+D7FF to U+E000 (relevant for apache#2486).
- Note on truncate-then-compare equivalence with Java's compare-then-truncate
  added inline.
- Tests: extracted single_primitive_field_schema helper; added
  test_min_max_aggregator_merges_truncated_strings_across_row_groups,
  test_min_max_aggregator_drops_upper_after_unbounded_row_group,
  test_truncate_lower_upper_bound_fixed_preserves_declared_type,
  test_min_max_aggregator_truncates_long_fixed_bounds.
…datum

After truncation, a Fixed(N) Datum carries fewer than N bytes in its
literal even though the declared type says N. Add a regression test that
exercises the two paths downstream consumers actually use:

- `Datum::to_bytes()` — manifest single-value serialization writes the
  literal bytes verbatim regardless of declared Fixed length.
- `PartialOrd` — wildcards on Fixed length and compares lex on raw bytes,
  so two truncated Fixed datums (16 bytes typed Fixed(20)) order
  correctly relative to each other.

This locks the contract that future changes to Datum / PrimitiveLiteral
serialization must preserve.
@SreeramGarlapati SreeramGarlapati changed the title feat(parquet): truncate manifest bounds for STRING/BINARY/FIXED to match Java feat(parquet_writer): truncate manifest bounds for STRING/BINARY/FIXED to match Java May 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant