Skip to content

test(parquet): trim verbose comments in arrow_writer_layout tests#17

Open
alamb wants to merge 16 commits into
pydantic:parquet-page-size-mid-batchfrom
alamb:alamb/remove-verbose-layout-test-comments
Open

test(parquet): trim verbose comments in arrow_writer_layout tests#17
alamb wants to merge 16 commits into
pydantic:parquet-page-size-mid-batchfrom
alamb:alamb/remove-verbose-layout-test-comments

Conversation

@alamb
Copy link
Copy Markdown

@alamb alamb commented Jun 3, 2026

Targets apache#9972 (pydantic:parquet-page-size-mid-batch).

Removes the two multi-line explanatory comment blocks from test_fixed_size_binary and test_dictionary in parquet/tests/arrow_writer_layout.rs, keeping the one-line summaries. The detailed rationale belongs in the PR discussion rather than inline in the test bodies.

Net: -20 lines, comments only. All arrow_writer_layout tests still pass.

adriangb and others added 16 commits June 2, 2026 07:04
Add the regression tests first, before any fix (TDD). With unmodified
`main` the page-bounding assertions fail: the column writer only checks
the data/dictionary page byte limit *after* each `write_batch_size`
mini-batch, so large variable-width values pile into a single oversized
page (we've observed 2 GiB data pages and ~256x dictionary-page
overshoot at default settings).

Column-writer tests (`ColumnValueEncoderImpl` path):
- large BYTE_ARRAY values cap data pages near one value each
- large values inside a repeated/list column (record-boundary stepping)
- nullable column (value vs level counting)
- dictionary spill then plain-encode large values
- large distinct values bound the dictionary page
- FIXED_LEN_BYTE_ARRAY byte budget

Arrow-writer tests (`ByteArrayEncoder` path, what real users hit):
- large `Utf8` strings via `ArrowWriter`
- mixed small/large strings round-trip bit-identically
- large `Utf8View` strings
- all-null string column stays correct

The subsequent commits make each of these pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Make the mini-batch size byte-budget aware so the post-write page check
fires before a data page grows unbounded:

- New `ColumnValueEncoder::count_values_within_byte_budget{,_gather}`
  trait methods (default `None` = "no estimate; stay batched"), with a
  concrete impl on `ColumnValueEncoderImpl` driven by
  `plain_encoded_byte_size`. Fixed-width physical types answer in one
  division; only variable-width BYTE_ARRAY/FLBA walk values, exiting at
  the first that overruns the budget.
- New `LevelDataRef::value_count` converts a chunk's level span into a
  leaf-value count (O(1) for flat columns, def-level scan for
  nullable/nested), with a unit test.
- New `ByteBudgetChunker` picks the largest sub-batch that fits one page
  budget. For the common case (small or fixed-width values) it returns
  the full chunk with no value inspection, so the hot path is unchanged.
- `write_batch_internal` consults the chunker per chunk and, only when a
  chunk would overflow, routes through the new `write_granular_chunk`,
  which sub-batches so the post-write check fires in time.
  Repeated/nested columns step on record (rep == 0) boundaries so a
  record never spans pages.

This makes the column-writer data-page, list, nullable and FLBA
regression tests pass. Dictionary-encoding columns are still left on the
batched path (the data page holds only small RLE indices) — bounding the
dictionary page is a separate commit, so the two dictionary tests and
the arrow `ByteArrayEncoder` tests do not pass yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement `ColumnValueEncoder::count_values_within_byte_budget_gather`
for `ByteArrayEncoder`, the encoder real `ArrowWriter` users hit. This
makes the page-size bound fire for arrow string/binary columns; the
previous commit only wired the generic column-writer path. Makes the
arrow-writer regression tests pass.

The implementation stays off the hot path for small values via cheap
O(1) upper bounds before any per-value walk:

- Offset-backed arrays (`Utf8`/`LargeUtf8`/`Binary`/`LargeBinary`): the
  span `offsets[last+1] - offsets[first]` bounds the chunk's payload in
  O(1); if it fits, every value fits. The span is exact even for
  nullable columns (skipped positions are nulls with zero offset delta),
  so sparse `indices` skip the per-value walk too.
- View arrays (`Utf8View`/`BinaryView`): lengths live in the low 32 bits
  of each view word, so an O(1) `n * (max_value_len + 4)` bound skips the
  scan in the common case; otherwise scan lengths with no data-buffer
  deref.
- Dictionary input: treated as always-fitting — dict-encoded arrow input
  implies values small enough to dedup, the opposite of the blob case
  this targets, and a per-key walk measurably regressed the bench.
- FixedSizeBinary: falls through to the generic accessor walk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
While a column dictionary-encodes, the data page holds only small RLE
indices but the *dictionary* page accumulates the distinct values
themselves, and its spill check runs only once per mini-batch. A
mini-batch of large distinct values therefore interns
`write_batch_size * value_size` bytes into the dictionary page before
the check fires — ~256x the limit in the worst case.

Extend `ByteBudgetChunker` to bound the dictionary-encoding phase too:
when `has_dictionary()`, size the mini-batch against the dictionary
page's *remaining* budget (`dict_page_byte_limit - estimated_dict_page_size`)
rather than the data page. Fixed-width columns short-circuit via a
precomputed `static_dict_always_fits`, so only large variable-width
distinct values pay the per-value walk. Makes the two dictionary
regression tests pass.

`arrow_writer_layout`'s `test_string` updates accordingly: the
dictionary page is now bounded at its 1000-byte limit and spills one
mini-batch earlier (125 rows rather than 130) instead of overshooting to
1040.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The variable-width byte-budget walks returned the largest count whose
cumulative encoded size was *under* the budget, so each mini-batch
ended just short of the page threshold. When the input row batch did
not divide evenly into mini-batches, the remainder rolled into the
next page and produced a bimodal page-size pattern (e.g. 128B values,
64KB budget, 1024-row batches: 968 / 540 / 540 ... values per page).

Return the boundary value's index + 1 instead, so the mini-batch
crosses the threshold by exactly one value and the caller's page-flush
check trips immediately, with no leftover sliver carried into the next
page. The worst-case overshoot per page is one value's encoded size,
which already matched the previous behavior whenever a single value
alone exceeded the budget (the dropped .max(1) floor).

Reported by Ed Seidel in apache#9972 review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The mini-batch byte-budget walk now includes the value that crosses
the budget, so the dictionary in the spill sub-test fills at 126 rows
(1008 bytes) instead of 125 rows (1000 bytes), and the downstream
plain page picks up 1254 rows / 10032 bytes instead of 1255 / 10040.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dSizeBinary page sizing

Adds two `LayoutTest` cases to `arrow_writer_layout.rs` that exercise
byte-budget page-sizing paths introduced in apache#9972 through the real
`ArrowWriter` user path:

- `test_dictionary`: an arrow `DictionaryArray<Int32, Utf8>` input, which
  drives the dictionary-input arm of
  `ByteArrayEncoder::count_values_within_byte_budget_gather`
  (`DataType::Dictionary(_, _) => indices.len()`). Previously uncovered.

- `test_fixed_size_binary`: a non-dictionary `FixedSizeBinary` column,
  which the arrow writer routes through the generic
  `ColumnValueEncoderImpl<FixedLenByteArrayType>`. Covers the FLBA branch
  of `plain_encoded_byte_size` and the variable-width scan in
  `count_values_within_byte_budget` via the arrow path (only the raw
  column-writer test covered it before).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the regression tests first, before any fix (TDD). With unmodified
`main` the page-bounding assertions fail: the column writer only checks
the data/dictionary page byte limit *after* each `write_batch_size`
mini-batch, so large variable-width values pile into a single oversized
page (we've observed 2 GiB data pages and ~256x dictionary-page
overshoot at default settings).

Column-writer tests (`ColumnValueEncoderImpl` path):
- large BYTE_ARRAY values cap data pages near one value each
- large values inside a repeated/list column (record-boundary stepping)
- nullable column (value vs level counting)
- dictionary spill then plain-encode large values
- large distinct values bound the dictionary page
- FIXED_LEN_BYTE_ARRAY byte budget

Arrow-writer tests (`ByteArrayEncoder` path, what real users hit):
- large `Utf8` strings via `ArrowWriter`
- mixed small/large strings round-trip bit-identically
- large `Utf8View` strings
- all-null string column stays correct

The subsequent commits make each of these pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Make the mini-batch size byte-budget aware so the post-write page check
fires before a data page grows unbounded:

- New `ColumnValueEncoder::count_values_within_byte_budget{,_gather}`
  trait methods (default `None` = "no estimate; stay batched"), with a
  concrete impl on `ColumnValueEncoderImpl` driven by
  `plain_encoded_byte_size`. Fixed-width physical types answer in one
  division; only variable-width BYTE_ARRAY/FLBA walk values, exiting at
  the first that overruns the budget.
- New `LevelDataRef::value_count` converts a chunk's level span into a
  leaf-value count (O(1) for flat columns, def-level scan for
  nullable/nested), with a unit test.
- New `ByteBudgetChunker` picks the largest sub-batch that fits one page
  budget. For the common case (small or fixed-width values) it returns
  the full chunk with no value inspection, so the hot path is unchanged.
- `write_batch_internal` consults the chunker per chunk and, only when a
  chunk would overflow, routes through the new `write_granular_chunk`,
  which sub-batches so the post-write check fires in time.
  Repeated/nested columns step on record (rep == 0) boundaries so a
  record never spans pages.

This makes the column-writer data-page, list, nullable and FLBA
regression tests pass. Dictionary-encoding columns are still left on the
batched path (the data page holds only small RLE indices) — bounding the
dictionary page is a separate commit, so the two dictionary tests and
the arrow `ByteArrayEncoder` tests do not pass yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement `ColumnValueEncoder::count_values_within_byte_budget_gather`
for `ByteArrayEncoder`, the encoder real `ArrowWriter` users hit. This
makes the page-size bound fire for arrow string/binary columns; the
previous commit only wired the generic column-writer path. Makes the
arrow-writer regression tests pass.

The implementation stays off the hot path for small values via cheap
O(1) upper bounds before any per-value walk:

- Offset-backed arrays (`Utf8`/`LargeUtf8`/`Binary`/`LargeBinary`): the
  span `offsets[last+1] - offsets[first]` bounds the chunk's payload in
  O(1); if it fits, every value fits. The span is exact even for
  nullable columns (skipped positions are nulls with zero offset delta),
  so sparse `indices` skip the per-value walk too.
- View arrays (`Utf8View`/`BinaryView`): lengths live in the low 32 bits
  of each view word, so an O(1) `n * (max_value_len + 4)` bound skips the
  scan in the common case; otherwise scan lengths with no data-buffer
  deref.
- Dictionary input: treated as always-fitting — dict-encoded arrow input
  implies values small enough to dedup, the opposite of the blob case
  this targets, and a per-key walk measurably regressed the bench.
- FixedSizeBinary: falls through to the generic accessor walk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
While a column dictionary-encodes, the data page holds only small RLE
indices but the *dictionary* page accumulates the distinct values
themselves, and its spill check runs only once per mini-batch. A
mini-batch of large distinct values therefore interns
`write_batch_size * value_size` bytes into the dictionary page before
the check fires — ~256x the limit in the worst case.

Extend `ByteBudgetChunker` to bound the dictionary-encoding phase too:
when `has_dictionary()`, size the mini-batch against the dictionary
page's *remaining* budget (`dict_page_byte_limit - estimated_dict_page_size`)
rather than the data page. Fixed-width columns short-circuit via a
precomputed `static_dict_always_fits`, so only large variable-width
distinct values pay the per-value walk. Makes the two dictionary
regression tests pass.

`arrow_writer_layout`'s `test_string` updates accordingly: the
dictionary page is now bounded at its 1000-byte limit and spills one
mini-batch earlier (125 rows rather than 130) instead of overshooting to
1040.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The variable-width byte-budget walks returned the largest count whose
cumulative encoded size was *under* the budget, so each mini-batch
ended just short of the page threshold. When the input row batch did
not divide evenly into mini-batches, the remainder rolled into the
next page and produced a bimodal page-size pattern (e.g. 128B values,
64KB budget, 1024-row batches: 968 / 540 / 540 ... values per page).

Return the boundary value's index + 1 instead, so the mini-batch
crosses the threshold by exactly one value and the caller's page-flush
check trips immediately, with no leftover sliver carried into the next
page. The worst-case overshoot per page is one value's encoded size,
which already matched the previous behavior whenever a single value
alone exceeded the budget (the dropped .max(1) floor).

Reported by Ed Seidel in apache#9972 review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The mini-batch byte-budget walk now includes the value that crosses
the budget, so the dictionary in the spill sub-test fills at 126 rows
(1008 bytes) instead of 125 rows (1000 bytes), and the downstream
plain page picks up 1254 rows / 10032 bytes instead of 1255 / 10040.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
test(parquet): LayoutTest coverage for dictionary + FixedSizeBinary page sizing
Fixes the CI failures from #16 (cargo fmt + clippy unnecessary_cast in
arrow_writer_layout.rs) and applies that PR's follow-up recommendation:
remove the unreachable count_within_budget_accessor free function and its
`_ =>` accessor dispatch arm in ByteArrayEncoder.

The arm was dead code: every byte-array type ByteArrayEncoder is built for
has an explicit match arm, and a Dictionary(value = FixedSizeBinary) column
hits the Dictionary(_, _) arm (its values.data_type() is Dictionary). A bare
FixedSizeBinary column is routed to the generic column writer, never this
encoder. The arm is now an unreachable! with that rationale, and the
inaccurate "FixedSizeBinary falls through to the per-value walk" comment is
dropped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remove the multi-line explanatory comment blocks from `test_fixed_size_binary`
and `test_dictionary`, keeping the one-line summaries. The detail belongs in
the PR discussion, not inline in the test bodies.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alamb
Copy link
Copy Markdown
Author

alamb commented Jun 3, 2026

Here is a follow on @adriangb if you have time

@adriangb adriangb force-pushed the parquet-page-size-mid-batch branch 2 times, most recently from cacbd4b to c654237 Compare June 3, 2026 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants