test(parquet): trim verbose comments in arrow_writer_layout tests by alamb · Pull Request #17 · pydantic/arrow-rs

alamb · 2026-06-03T20:51:38Z

Targets apache#9972 (pydantic:parquet-page-size-mid-batch).

Removes the two multi-line explanatory comment blocks from test_fixed_size_binary and test_dictionary in parquet/tests/arrow_writer_layout.rs, keeping the one-line summaries. The detailed rationale belongs in the PR discussion rather than inline in the test bodies.

Net: -20 lines, comments only. All arrow_writer_layout tests still pass.

Add the regression tests first, before any fix (TDD). With unmodified `main` the page-bounding assertions fail: the column writer only checks the data/dictionary page byte limit *after* each `write_batch_size` mini-batch, so large variable-width values pile into a single oversized page (we've observed 2 GiB data pages and ~256x dictionary-page overshoot at default settings). Column-writer tests (`ColumnValueEncoderImpl` path): - large BYTE_ARRAY values cap data pages near one value each - large values inside a repeated/list column (record-boundary stepping) - nullable column (value vs level counting) - dictionary spill then plain-encode large values - large distinct values bound the dictionary page - FIXED_LEN_BYTE_ARRAY byte budget Arrow-writer tests (`ByteArrayEncoder` path, what real users hit): - large `Utf8` strings via `ArrowWriter` - mixed small/large strings round-trip bit-identically - large `Utf8View` strings - all-null string column stays correct The subsequent commits make each of these pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Make the mini-batch size byte-budget aware so the post-write page check fires before a data page grows unbounded: - New `ColumnValueEncoder::count_values_within_byte_budget{,_gather}` trait methods (default `None` = "no estimate; stay batched"), with a concrete impl on `ColumnValueEncoderImpl` driven by `plain_encoded_byte_size`. Fixed-width physical types answer in one division; only variable-width BYTE_ARRAY/FLBA walk values, exiting at the first that overruns the budget. - New `LevelDataRef::value_count` converts a chunk's level span into a leaf-value count (O(1) for flat columns, def-level scan for nullable/nested), with a unit test. - New `ByteBudgetChunker` picks the largest sub-batch that fits one page budget. For the common case (small or fixed-width values) it returns the full chunk with no value inspection, so the hot path is unchanged. - `write_batch_internal` consults the chunker per chunk and, only when a chunk would overflow, routes through the new `write_granular_chunk`, which sub-batches so the post-write check fires in time. Repeated/nested columns step on record (rep == 0) boundaries so a record never spans pages. This makes the column-writer data-page, list, nullable and FLBA regression tests pass. Dictionary-encoding columns are still left on the batched path (the data page holds only small RLE indices) — bounding the dictionary page is a separate commit, so the two dictionary tests and the arrow `ByteArrayEncoder` tests do not pass yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Implement `ColumnValueEncoder::count_values_within_byte_budget_gather` for `ByteArrayEncoder`, the encoder real `ArrowWriter` users hit. This makes the page-size bound fire for arrow string/binary columns; the previous commit only wired the generic column-writer path. Makes the arrow-writer regression tests pass. The implementation stays off the hot path for small values via cheap O(1) upper bounds before any per-value walk: - Offset-backed arrays (`Utf8`/`LargeUtf8`/`Binary`/`LargeBinary`): the span `offsets[last+1] - offsets[first]` bounds the chunk's payload in O(1); if it fits, every value fits. The span is exact even for nullable columns (skipped positions are nulls with zero offset delta), so sparse `indices` skip the per-value walk too. - View arrays (`Utf8View`/`BinaryView`): lengths live in the low 32 bits of each view word, so an O(1) `n * (max_value_len + 4)` bound skips the scan in the common case; otherwise scan lengths with no data-buffer deref. - Dictionary input: treated as always-fitting — dict-encoded arrow input implies values small enough to dedup, the opposite of the blob case this targets, and a per-key walk measurably regressed the bench. - FixedSizeBinary: falls through to the generic accessor walk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

While a column dictionary-encodes, the data page holds only small RLE indices but the *dictionary* page accumulates the distinct values themselves, and its spill check runs only once per mini-batch. A mini-batch of large distinct values therefore interns `write_batch_size * value_size` bytes into the dictionary page before the check fires — ~256x the limit in the worst case. Extend `ByteBudgetChunker` to bound the dictionary-encoding phase too: when `has_dictionary()`, size the mini-batch against the dictionary page's *remaining* budget (`dict_page_byte_limit - estimated_dict_page_size`) rather than the data page. Fixed-width columns short-circuit via a precomputed `static_dict_always_fits`, so only large variable-width distinct values pay the per-value walk. Makes the two dictionary regression tests pass. `arrow_writer_layout`'s `test_string` updates accordingly: the dictionary page is now bounded at its 1000-byte limit and spills one mini-batch earlier (125 rows rather than 130) instead of overshooting to 1040. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The variable-width byte-budget walks returned the largest count whose cumulative encoded size was *under* the budget, so each mini-batch ended just short of the page threshold. When the input row batch did not divide evenly into mini-batches, the remainder rolled into the next page and produced a bimodal page-size pattern (e.g. 128B values, 64KB budget, 1024-row batches: 968 / 540 / 540 ... values per page). Return the boundary value's index + 1 instead, so the mini-batch crosses the threshold by exactly one value and the caller's page-flush check trips immediately, with no leftover sliver carried into the next page. The worst-case overshoot per page is one value's encoded size, which already matched the previous behavior whenever a single value alone exceeded the budget (the dropped .max(1) floor). Reported by Ed Seidel in apache#9972 review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The mini-batch byte-budget walk now includes the value that crosses the budget, so the dictionary in the spill sub-test fills at 126 rows (1008 bytes) instead of 125 rows (1000 bytes), and the downstream plain page picks up 1254 rows / 10032 bytes instead of 1255 / 10040. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dSizeBinary page sizing Adds two `LayoutTest` cases to `arrow_writer_layout.rs` that exercise byte-budget page-sizing paths introduced in apache#9972 through the real `ArrowWriter` user path: - `test_dictionary`: an arrow `DictionaryArray<Int32, Utf8>` input, which drives the dictionary-input arm of `ByteArrayEncoder::count_values_within_byte_budget_gather` (`DataType::Dictionary(_, _) => indices.len()`). Previously uncovered. - `test_fixed_size_binary`: a non-dictionary `FixedSizeBinary` column, which the arrow writer routes through the generic `ColumnValueEncoderImpl<FixedLenByteArrayType>`. Covers the FLBA branch of `plain_encoded_byte_size` and the variable-width scan in `count_values_within_byte_budget` via the arrow path (only the raw column-writer test covered it before). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add the regression tests first, before any fix (TDD). With unmodified `main` the page-bounding assertions fail: the column writer only checks the data/dictionary page byte limit *after* each `write_batch_size` mini-batch, so large variable-width values pile into a single oversized page (we've observed 2 GiB data pages and ~256x dictionary-page overshoot at default settings). Column-writer tests (`ColumnValueEncoderImpl` path): - large BYTE_ARRAY values cap data pages near one value each - large values inside a repeated/list column (record-boundary stepping) - nullable column (value vs level counting) - dictionary spill then plain-encode large values - large distinct values bound the dictionary page - FIXED_LEN_BYTE_ARRAY byte budget Arrow-writer tests (`ByteArrayEncoder` path, what real users hit): - large `Utf8` strings via `ArrowWriter` - mixed small/large strings round-trip bit-identically - large `Utf8View` strings - all-null string column stays correct The subsequent commits make each of these pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Make the mini-batch size byte-budget aware so the post-write page check fires before a data page grows unbounded: - New `ColumnValueEncoder::count_values_within_byte_budget{,_gather}` trait methods (default `None` = "no estimate; stay batched"), with a concrete impl on `ColumnValueEncoderImpl` driven by `plain_encoded_byte_size`. Fixed-width physical types answer in one division; only variable-width BYTE_ARRAY/FLBA walk values, exiting at the first that overruns the budget. - New `LevelDataRef::value_count` converts a chunk's level span into a leaf-value count (O(1) for flat columns, def-level scan for nullable/nested), with a unit test. - New `ByteBudgetChunker` picks the largest sub-batch that fits one page budget. For the common case (small or fixed-width values) it returns the full chunk with no value inspection, so the hot path is unchanged. - `write_batch_internal` consults the chunker per chunk and, only when a chunk would overflow, routes through the new `write_granular_chunk`, which sub-batches so the post-write check fires in time. Repeated/nested columns step on record (rep == 0) boundaries so a record never spans pages. This makes the column-writer data-page, list, nullable and FLBA regression tests pass. Dictionary-encoding columns are still left on the batched path (the data page holds only small RLE indices) — bounding the dictionary page is a separate commit, so the two dictionary tests and the arrow `ByteArrayEncoder` tests do not pass yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Implement `ColumnValueEncoder::count_values_within_byte_budget_gather` for `ByteArrayEncoder`, the encoder real `ArrowWriter` users hit. This makes the page-size bound fire for arrow string/binary columns; the previous commit only wired the generic column-writer path. Makes the arrow-writer regression tests pass. The implementation stays off the hot path for small values via cheap O(1) upper bounds before any per-value walk: - Offset-backed arrays (`Utf8`/`LargeUtf8`/`Binary`/`LargeBinary`): the span `offsets[last+1] - offsets[first]` bounds the chunk's payload in O(1); if it fits, every value fits. The span is exact even for nullable columns (skipped positions are nulls with zero offset delta), so sparse `indices` skip the per-value walk too. - View arrays (`Utf8View`/`BinaryView`): lengths live in the low 32 bits of each view word, so an O(1) `n * (max_value_len + 4)` bound skips the scan in the common case; otherwise scan lengths with no data-buffer deref. - Dictionary input: treated as always-fitting — dict-encoded arrow input implies values small enough to dedup, the opposite of the blob case this targets, and a per-key walk measurably regressed the bench. - FixedSizeBinary: falls through to the generic accessor walk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

While a column dictionary-encodes, the data page holds only small RLE indices but the *dictionary* page accumulates the distinct values themselves, and its spill check runs only once per mini-batch. A mini-batch of large distinct values therefore interns `write_batch_size * value_size` bytes into the dictionary page before the check fires — ~256x the limit in the worst case. Extend `ByteBudgetChunker` to bound the dictionary-encoding phase too: when `has_dictionary()`, size the mini-batch against the dictionary page's *remaining* budget (`dict_page_byte_limit - estimated_dict_page_size`) rather than the data page. Fixed-width columns short-circuit via a precomputed `static_dict_always_fits`, so only large variable-width distinct values pay the per-value walk. Makes the two dictionary regression tests pass. `arrow_writer_layout`'s `test_string` updates accordingly: the dictionary page is now bounded at its 1000-byte limit and spills one mini-batch earlier (125 rows rather than 130) instead of overshooting to 1040. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The variable-width byte-budget walks returned the largest count whose cumulative encoded size was *under* the budget, so each mini-batch ended just short of the page threshold. When the input row batch did not divide evenly into mini-batches, the remainder rolled into the next page and produced a bimodal page-size pattern (e.g. 128B values, 64KB budget, 1024-row batches: 968 / 540 / 540 ... values per page). Return the boundary value's index + 1 instead, so the mini-batch crosses the threshold by exactly one value and the caller's page-flush check trips immediately, with no leftover sliver carried into the next page. The worst-case overshoot per page is one value's encoded size, which already matched the previous behavior whenever a single value alone exceeded the budget (the dropped .max(1) floor). Reported by Ed Seidel in apache#9972 review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The mini-batch byte-budget walk now includes the value that crosses the budget, so the dictionary in the spill sub-test fills at 126 rows (1008 bytes) instead of 125 rows (1000 bytes), and the downstream plain page picks up 1254 rows / 10032 bytes instead of 1255 / 10040. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(parquet): LayoutTest coverage for dictionary + FixedSizeBinary page sizing

Fixes the CI failures from #16 (cargo fmt + clippy unnecessary_cast in arrow_writer_layout.rs) and applies that PR's follow-up recommendation: remove the unreachable count_within_budget_accessor free function and its `_ =>` accessor dispatch arm in ByteArrayEncoder. The arm was dead code: every byte-array type ByteArrayEncoder is built for has an explicit match arm, and a Dictionary(value = FixedSizeBinary) column hits the Dictionary(_, _) arm (its values.data_type() is Dictionary). A bare FixedSizeBinary column is routed to the generic column writer, never this encoder. The arm is now an unreachable! with that rationale, and the inaccurate "FixedSizeBinary falls through to the per-value walk" comment is dropped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Remove the multi-line explanatory comment blocks from `test_fixed_size_binary` and `test_dictionary`, keeping the one-line summaries. The detail belongs in the PR discussion, not inline in the test bodies. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

alamb · 2026-06-03T21:43:14Z

Here is a follow on @adriangb if you have time

adriangb and others added 16 commits June 2, 2026 07:04

Merge pull request #16 from alamb/alamb/parquet-page-size-layout-tests

7d305a6

test(parquet): LayoutTest coverage for dictionary + FixedSizeBinary page sizing

github-actions Bot added the parquet label Jun 3, 2026

adriangb force-pushed the parquet-page-size-mid-batch branch 2 times, most recently from cacbd4b to c654237 Compare June 3, 2026 22:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(parquet): trim verbose comments in arrow_writer_layout tests#17

test(parquet): trim verbose comments in arrow_writer_layout tests#17
alamb wants to merge 16 commits into
pydantic:parquet-page-size-mid-batchfrom
alamb:alamb/remove-verbose-layout-test-comments

alamb commented Jun 3, 2026

Uh oh!

alamb commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alamb commented Jun 3, 2026

Uh oh!

alamb commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants