test(parquet): trim verbose comments in arrow_writer_layout tests#17
Open
alamb wants to merge 16 commits into
Open
test(parquet): trim verbose comments in arrow_writer_layout tests#17alamb wants to merge 16 commits into
alamb wants to merge 16 commits into
Conversation
Add the regression tests first, before any fix (TDD). With unmodified `main` the page-bounding assertions fail: the column writer only checks the data/dictionary page byte limit *after* each `write_batch_size` mini-batch, so large variable-width values pile into a single oversized page (we've observed 2 GiB data pages and ~256x dictionary-page overshoot at default settings). Column-writer tests (`ColumnValueEncoderImpl` path): - large BYTE_ARRAY values cap data pages near one value each - large values inside a repeated/list column (record-boundary stepping) - nullable column (value vs level counting) - dictionary spill then plain-encode large values - large distinct values bound the dictionary page - FIXED_LEN_BYTE_ARRAY byte budget Arrow-writer tests (`ByteArrayEncoder` path, what real users hit): - large `Utf8` strings via `ArrowWriter` - mixed small/large strings round-trip bit-identically - large `Utf8View` strings - all-null string column stays correct The subsequent commits make each of these pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Make the mini-batch size byte-budget aware so the post-write page check
fires before a data page grows unbounded:
- New `ColumnValueEncoder::count_values_within_byte_budget{,_gather}`
trait methods (default `None` = "no estimate; stay batched"), with a
concrete impl on `ColumnValueEncoderImpl` driven by
`plain_encoded_byte_size`. Fixed-width physical types answer in one
division; only variable-width BYTE_ARRAY/FLBA walk values, exiting at
the first that overruns the budget.
- New `LevelDataRef::value_count` converts a chunk's level span into a
leaf-value count (O(1) for flat columns, def-level scan for
nullable/nested), with a unit test.
- New `ByteBudgetChunker` picks the largest sub-batch that fits one page
budget. For the common case (small or fixed-width values) it returns
the full chunk with no value inspection, so the hot path is unchanged.
- `write_batch_internal` consults the chunker per chunk and, only when a
chunk would overflow, routes through the new `write_granular_chunk`,
which sub-batches so the post-write check fires in time.
Repeated/nested columns step on record (rep == 0) boundaries so a
record never spans pages.
This makes the column-writer data-page, list, nullable and FLBA
regression tests pass. Dictionary-encoding columns are still left on the
batched path (the data page holds only small RLE indices) — bounding the
dictionary page is a separate commit, so the two dictionary tests and
the arrow `ByteArrayEncoder` tests do not pass yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement `ColumnValueEncoder::count_values_within_byte_budget_gather` for `ByteArrayEncoder`, the encoder real `ArrowWriter` users hit. This makes the page-size bound fire for arrow string/binary columns; the previous commit only wired the generic column-writer path. Makes the arrow-writer regression tests pass. The implementation stays off the hot path for small values via cheap O(1) upper bounds before any per-value walk: - Offset-backed arrays (`Utf8`/`LargeUtf8`/`Binary`/`LargeBinary`): the span `offsets[last+1] - offsets[first]` bounds the chunk's payload in O(1); if it fits, every value fits. The span is exact even for nullable columns (skipped positions are nulls with zero offset delta), so sparse `indices` skip the per-value walk too. - View arrays (`Utf8View`/`BinaryView`): lengths live in the low 32 bits of each view word, so an O(1) `n * (max_value_len + 4)` bound skips the scan in the common case; otherwise scan lengths with no data-buffer deref. - Dictionary input: treated as always-fitting — dict-encoded arrow input implies values small enough to dedup, the opposite of the blob case this targets, and a per-key walk measurably regressed the bench. - FixedSizeBinary: falls through to the generic accessor walk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
While a column dictionary-encodes, the data page holds only small RLE indices but the *dictionary* page accumulates the distinct values themselves, and its spill check runs only once per mini-batch. A mini-batch of large distinct values therefore interns `write_batch_size * value_size` bytes into the dictionary page before the check fires — ~256x the limit in the worst case. Extend `ByteBudgetChunker` to bound the dictionary-encoding phase too: when `has_dictionary()`, size the mini-batch against the dictionary page's *remaining* budget (`dict_page_byte_limit - estimated_dict_page_size`) rather than the data page. Fixed-width columns short-circuit via a precomputed `static_dict_always_fits`, so only large variable-width distinct values pay the per-value walk. Makes the two dictionary regression tests pass. `arrow_writer_layout`'s `test_string` updates accordingly: the dictionary page is now bounded at its 1000-byte limit and spills one mini-batch earlier (125 rows rather than 130) instead of overshooting to 1040. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The variable-width byte-budget walks returned the largest count whose cumulative encoded size was *under* the budget, so each mini-batch ended just short of the page threshold. When the input row batch did not divide evenly into mini-batches, the remainder rolled into the next page and produced a bimodal page-size pattern (e.g. 128B values, 64KB budget, 1024-row batches: 968 / 540 / 540 ... values per page). Return the boundary value's index + 1 instead, so the mini-batch crosses the threshold by exactly one value and the caller's page-flush check trips immediately, with no leftover sliver carried into the next page. The worst-case overshoot per page is one value's encoded size, which already matched the previous behavior whenever a single value alone exceeded the budget (the dropped .max(1) floor). Reported by Ed Seidel in apache#9972 review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The mini-batch byte-budget walk now includes the value that crosses the budget, so the dictionary in the spill sub-test fills at 126 rows (1008 bytes) instead of 125 rows (1000 bytes), and the downstream plain page picks up 1254 rows / 10032 bytes instead of 1255 / 10040. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dSizeBinary page sizing Adds two `LayoutTest` cases to `arrow_writer_layout.rs` that exercise byte-budget page-sizing paths introduced in apache#9972 through the real `ArrowWriter` user path: - `test_dictionary`: an arrow `DictionaryArray<Int32, Utf8>` input, which drives the dictionary-input arm of `ByteArrayEncoder::count_values_within_byte_budget_gather` (`DataType::Dictionary(_, _) => indices.len()`). Previously uncovered. - `test_fixed_size_binary`: a non-dictionary `FixedSizeBinary` column, which the arrow writer routes through the generic `ColumnValueEncoderImpl<FixedLenByteArrayType>`. Covers the FLBA branch of `plain_encoded_byte_size` and the variable-width scan in `count_values_within_byte_budget` via the arrow path (only the raw column-writer test covered it before). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the regression tests first, before any fix (TDD). With unmodified `main` the page-bounding assertions fail: the column writer only checks the data/dictionary page byte limit *after* each `write_batch_size` mini-batch, so large variable-width values pile into a single oversized page (we've observed 2 GiB data pages and ~256x dictionary-page overshoot at default settings). Column-writer tests (`ColumnValueEncoderImpl` path): - large BYTE_ARRAY values cap data pages near one value each - large values inside a repeated/list column (record-boundary stepping) - nullable column (value vs level counting) - dictionary spill then plain-encode large values - large distinct values bound the dictionary page - FIXED_LEN_BYTE_ARRAY byte budget Arrow-writer tests (`ByteArrayEncoder` path, what real users hit): - large `Utf8` strings via `ArrowWriter` - mixed small/large strings round-trip bit-identically - large `Utf8View` strings - all-null string column stays correct The subsequent commits make each of these pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Make the mini-batch size byte-budget aware so the post-write page check
fires before a data page grows unbounded:
- New `ColumnValueEncoder::count_values_within_byte_budget{,_gather}`
trait methods (default `None` = "no estimate; stay batched"), with a
concrete impl on `ColumnValueEncoderImpl` driven by
`plain_encoded_byte_size`. Fixed-width physical types answer in one
division; only variable-width BYTE_ARRAY/FLBA walk values, exiting at
the first that overruns the budget.
- New `LevelDataRef::value_count` converts a chunk's level span into a
leaf-value count (O(1) for flat columns, def-level scan for
nullable/nested), with a unit test.
- New `ByteBudgetChunker` picks the largest sub-batch that fits one page
budget. For the common case (small or fixed-width values) it returns
the full chunk with no value inspection, so the hot path is unchanged.
- `write_batch_internal` consults the chunker per chunk and, only when a
chunk would overflow, routes through the new `write_granular_chunk`,
which sub-batches so the post-write check fires in time.
Repeated/nested columns step on record (rep == 0) boundaries so a
record never spans pages.
This makes the column-writer data-page, list, nullable and FLBA
regression tests pass. Dictionary-encoding columns are still left on the
batched path (the data page holds only small RLE indices) — bounding the
dictionary page is a separate commit, so the two dictionary tests and
the arrow `ByteArrayEncoder` tests do not pass yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement `ColumnValueEncoder::count_values_within_byte_budget_gather` for `ByteArrayEncoder`, the encoder real `ArrowWriter` users hit. This makes the page-size bound fire for arrow string/binary columns; the previous commit only wired the generic column-writer path. Makes the arrow-writer regression tests pass. The implementation stays off the hot path for small values via cheap O(1) upper bounds before any per-value walk: - Offset-backed arrays (`Utf8`/`LargeUtf8`/`Binary`/`LargeBinary`): the span `offsets[last+1] - offsets[first]` bounds the chunk's payload in O(1); if it fits, every value fits. The span is exact even for nullable columns (skipped positions are nulls with zero offset delta), so sparse `indices` skip the per-value walk too. - View arrays (`Utf8View`/`BinaryView`): lengths live in the low 32 bits of each view word, so an O(1) `n * (max_value_len + 4)` bound skips the scan in the common case; otherwise scan lengths with no data-buffer deref. - Dictionary input: treated as always-fitting — dict-encoded arrow input implies values small enough to dedup, the opposite of the blob case this targets, and a per-key walk measurably regressed the bench. - FixedSizeBinary: falls through to the generic accessor walk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
While a column dictionary-encodes, the data page holds only small RLE indices but the *dictionary* page accumulates the distinct values themselves, and its spill check runs only once per mini-batch. A mini-batch of large distinct values therefore interns `write_batch_size * value_size` bytes into the dictionary page before the check fires — ~256x the limit in the worst case. Extend `ByteBudgetChunker` to bound the dictionary-encoding phase too: when `has_dictionary()`, size the mini-batch against the dictionary page's *remaining* budget (`dict_page_byte_limit - estimated_dict_page_size`) rather than the data page. Fixed-width columns short-circuit via a precomputed `static_dict_always_fits`, so only large variable-width distinct values pay the per-value walk. Makes the two dictionary regression tests pass. `arrow_writer_layout`'s `test_string` updates accordingly: the dictionary page is now bounded at its 1000-byte limit and spills one mini-batch earlier (125 rows rather than 130) instead of overshooting to 1040. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The variable-width byte-budget walks returned the largest count whose cumulative encoded size was *under* the budget, so each mini-batch ended just short of the page threshold. When the input row batch did not divide evenly into mini-batches, the remainder rolled into the next page and produced a bimodal page-size pattern (e.g. 128B values, 64KB budget, 1024-row batches: 968 / 540 / 540 ... values per page). Return the boundary value's index + 1 instead, so the mini-batch crosses the threshold by exactly one value and the caller's page-flush check trips immediately, with no leftover sliver carried into the next page. The worst-case overshoot per page is one value's encoded size, which already matched the previous behavior whenever a single value alone exceeded the budget (the dropped .max(1) floor). Reported by Ed Seidel in apache#9972 review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The mini-batch byte-budget walk now includes the value that crosses the budget, so the dictionary in the spill sub-test fills at 126 rows (1008 bytes) instead of 125 rows (1000 bytes), and the downstream plain page picks up 1254 rows / 10032 bytes instead of 1255 / 10040. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
test(parquet): LayoutTest coverage for dictionary + FixedSizeBinary page sizing
Fixes the CI failures from #16 (cargo fmt + clippy unnecessary_cast in arrow_writer_layout.rs) and applies that PR's follow-up recommendation: remove the unreachable count_within_budget_accessor free function and its `_ =>` accessor dispatch arm in ByteArrayEncoder. The arm was dead code: every byte-array type ByteArrayEncoder is built for has an explicit match arm, and a Dictionary(value = FixedSizeBinary) column hits the Dictionary(_, _) arm (its values.data_type() is Dictionary). A bare FixedSizeBinary column is routed to the generic column writer, never this encoder. The arm is now an unreachable! with that rationale, and the inaccurate "FixedSizeBinary falls through to the per-value walk" comment is dropped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remove the multi-line explanatory comment blocks from `test_fixed_size_binary` and `test_dictionary`, keeping the one-line summaries. The detail belongs in the PR discussion, not inline in the test bodies. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Author
|
Here is a follow on @adriangb if you have time |
cacbd4b to
c654237
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Targets apache#9972 (
pydantic:parquet-page-size-mid-batch).Removes the two multi-line explanatory comment blocks from
test_fixed_size_binaryandtest_dictionaryinparquet/tests/arrow_writer_layout.rs, keeping the one-line summaries. The detailed rationale belongs in the PR discussion rather than inline in the test bodies.Net:
-20lines, comments only. Allarrow_writer_layouttests still pass.