fix(parquet): bound data page byte size for large variable-width values#9972
fix(parquet): bound data page byte size for large variable-width values#9972adriangb wants to merge 3 commits into
Conversation
|
run benchmark arrow_writer |
393ead0 to
4823429
Compare
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (4823429) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer |
0fd6dcb to
24b83c7
Compare
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (24b83c7) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (24b83c7) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (70dc497) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (bbe2b7e) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (bbe2b7e) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
Have you considered making the batch size configurable per column? |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
Yes, that may be a simpler approach. But I'm hoping we can get to a place where users don't have to think about / configure this. Given they gave us a page size limit it'd be nice if we can always adhere to that... |
|
Another thought...maybe add another chunker like the CDC work added ( ). If we compute batches up front when we know the shape of the data that might be faster 🤷 |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (145ea5d) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
I also filed a ticket to track this |
The variable-width byte-budget walks returned the largest count whose cumulative encoded size was *under* the budget, so each mini-batch ended just short of the page threshold. When the input row batch did not divide evenly into mini-batches, the remainder rolled into the next page and produced a bimodal page-size pattern (e.g. 128B values, 64KB budget, 1024-row batches: 968 / 540 / 540 ... values per page). Return the boundary value's index + 1 instead, so the mini-batch crosses the threshold by exactly one value and the caller's page-flush check trips immediately, with no leftover sliver carried into the next page. The worst-case overshoot per page is one value's encoded size, which already matched the previous behavior whenever a single value alone exceeded the budget (the dropped .max(1) floor). Reported by Ed Seidel in apache#9972 review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4769588 to
bcdb878
Compare
…dSizeBinary page sizing Adds two `LayoutTest` cases to `arrow_writer_layout.rs` that exercise byte-budget page-sizing paths introduced in apache#9972 through the real `ArrowWriter` user path: - `test_dictionary`: an arrow `DictionaryArray<Int32, Utf8>` input, which drives the dictionary-input arm of `ByteArrayEncoder::count_values_within_byte_budget_gather` (`DataType::Dictionary(_, _) => indices.len()`). Previously uncovered. - `test_fixed_size_binary`: a non-dictionary `FixedSizeBinary` column, which the arrow writer routes through the generic `ColumnValueEncoderImpl<FixedLenByteArrayType>`. Covers the FLBA branch of `plain_encoded_byte_size` and the variable-width scan in `count_values_within_byte_budget` via the arrow path (only the raw column-writer test covered it before). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dSizeBinary page sizing Adds two `LayoutTest` cases to `arrow_writer_layout.rs` that exercise byte-budget page-sizing paths introduced in apache#9972 through the real `ArrowWriter` user path: - `test_dictionary`: an arrow `DictionaryArray<Int32, Utf8>` input, which drives the dictionary-input arm of `ByteArrayEncoder::count_values_within_byte_budget_gather` (`DataType::Dictionary(_, _) => indices.len()`). Previously uncovered. - `test_fixed_size_binary`: a non-dictionary `FixedSizeBinary` column, which the arrow writer routes through the generic `ColumnValueEncoderImpl<FixedLenByteArrayType>`. Covers the FLBA branch of `plain_encoded_byte_size` and the variable-width scan in `count_values_within_byte_budget` via the arrow path (only the raw column-writer test covered it before). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
@alamb the numbers look like they have a real regression now: It wasn't the case in #9972 (comment) so I'll check if any of the changes since then may have caused it. |
Addresses the easy doc/comment items from alamb's review of apache#9972: - byte_array.rs: trim the dictionary-arm comment to "values are already small and deduplicated"; demote max_view_value_len's prose to an inline comment; give count_within_budget_views / count_within_budget_offsets one-line "what it returns" doc summaries and explain that the + size_of::<u32>() is the 4-byte plain BYTE_ARRAY length prefix (not string content). - encoder.rs: drop the parallel-English doc on plain_encoded_byte_size, moving the BYTE_ARRAY/FLBA/numeric rationale inline; fix the broken "dict_encoder.rs::push (line ~52)" reference to name KeyStorage::push. - byte_budget_chunker.rs: move the module background onto ByteBudgetChunker; module doc now just links to it. - column/writer/mod.rs: de-jargon the write_granular_chunk doc; reword the record-packing comment so it no longer reads like it describes a former implementation. - drop the unresolvable "(see apache#9972 discussion)" references. Comments only; no logic changes. fmt/clippy/rustdoc clean, tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Follow-ups from alamb's review of apache#9972: - write_batch_internal: add a comment marking pick_sub_batch_size as the key decision point (write whole mini-batch vs fall back to byte accounting). - write_granular_chunk: drop the unreachable `sub_batch_size == 0` defensive branch. The chunker always sizes >= 1 level and the function is only entered when sub_batch_size < chunk_size, so `e > sub_start` always holds. Replaced with a debug_assert documenting the invariant. - Extract a write_and_collect_pages test helper (+ CollectedPages) and rewrite the six page-size regression tests through it, so each only expresses its props, input, and assertions instead of repeating the TrackedWrite/SerializedPageWriter/reader boilerplate. ~130 fewer lines, no coverage change. All column-writer + layout tests pass; fmt/clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses the easy doc/comment items from alamb's review of apache#9972: - byte_array.rs: trim the dictionary-arm comment to "values are already small and deduplicated"; demote max_view_value_len's prose to an inline comment; give count_within_budget_views / count_within_budget_offsets one-line "what it returns" doc summaries and explain that the + size_of::<u32>() is the 4-byte plain BYTE_ARRAY length prefix (not string content). - encoder.rs: drop the parallel-English doc on plain_encoded_byte_size, moving the BYTE_ARRAY/FLBA/numeric rationale inline; fix the broken "dict_encoder.rs::push (line ~52)" reference to name KeyStorage::push. - byte_budget_chunker.rs: move the module background onto ByteBudgetChunker; module doc now just links to it. - column/writer/mod.rs: de-jargon the write_granular_chunk doc; reword the record-packing comment so it no longer reads like it describes a former implementation. - drop the unresolvable "(see apache#9972 discussion)" references. Comments only; no logic changes. fmt/clippy/rustdoc clean, tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Follow-ups from alamb's review of apache#9972: - write_batch_internal: add a comment marking pick_sub_batch_size as the key decision point (write whole mini-batch vs fall back to byte accounting). - write_granular_chunk: drop the unreachable `sub_batch_size == 0` defensive branch. The chunker always sizes >= 1 level and the function is only entered when sub_batch_size < chunk_size, so `e > sub_start` always holds. Replaced with a debug_assert documenting the invariant. - Extract a write_and_collect_pages test helper (+ CollectedPages) and rewrite the six page-size regression tests through it, so each only expresses its props, input, and assertions instead of repeating the TrackedWrite/SerializedPageWriter/reader boilerplate. ~130 fewer lines, no coverage change. All column-writer + layout tests pass; fmt/clippy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
e79366b to
3e39ba1
Compare
Addresses the last open review thread on apache#9972 (alamb, encoder.rs:329): the contiguous and gather variants were near-identical copies of the fixed-width O(1) shortcut and the variable-width cumulative scan. Extract the shared core into a private `count_within_budget` helper over an `Iterator<Item = Option<&T::T>>`; the contiguous path maps each value to `Some`, the gather path maps each index through `values.get` (yielding `None`, counted but zero-byte, for the defensive out-of-range case — preserving the original `continue`-but-advance-`i` semantics exactly). The helper is defined at the end of the module next to plain_encoded_byte_size for the same code-placement reason documented there. Benchmarked to confirm no regression from the documented placement sensitivity (string / string_and_binary_view, cargo bench arrow_writer, vs pre-change baseline): all cases within noise (string/default improved ~4%, the rest "no change"/"within noise", max +0.9% on one view case that read "no change" on a confirmation run). Correctness: column::writer tests (100) + arrow_writer_layout (11) pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- closes apache#10061 The column writer only checks the data/dictionary page byte limit *after* each `write_batch_size` mini-batch, so a batch of large variable-width values piles into a single oversized page before the check fires (we've observed multi-GiB data pages and large dictionary-page overshoot at default settings). Make the mini-batch size byte-budget aware in the generic column writer: - `ColumnValueEncoder::count_values_within_byte_budget{,_gather}` (default `None` = "no estimate, stay batched"), with a concrete impl on `ColumnValueEncoderImpl` driven by `plain_encoded_byte_size`. Fixed-width physical types answer in one division; only variable-width BYTE_ARRAY/FLBA walk values, stopping at the first that overruns. - `LevelDataRef::value_count` converts a chunk's level span into a leaf value count (O(1) for flat columns, def-level scan when nullable/nested). - `ByteBudgetChunker` picks the largest sub-batch that fits one page budget. The common case (small or fixed-width values) returns the whole chunk with no value inspection, so the hot path is unchanged. During dictionary encoding it sizes against the dictionary page's remaining budget instead, since the data page then holds only small RLE indices. - `write_batch_internal` consults the chunker per chunk and, only when a chunk would overflow, routes through `write_granular_chunk`, which sub-batches so the post-write page check fires in time. Repeated/nested columns step on record (rep == 0) boundaries so a record never spans pages. Includes the `ColumnWriterImpl`-level regression tests (data page, list, nullable, FLBA, dictionary spill, dictionary page bound). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implement `ColumnValueEncoder::count_values_within_byte_budget_gather` for `ByteArrayEncoder`, the encoder real `ArrowWriter` users hit, so the page-size bound from the previous commit also fires for arrow string/binary columns (the generic path only covered `ColumnValueEncoderImpl`). The impl stays off the hot path for small values via cheap O(1) upper bounds before any per-value walk: - Offset-backed arrays (`Utf8`/`LargeUtf8`/`Binary`/`LargeBinary`): the span `offsets[last+1] - offsets[first]` bounds the chunk payload in O(1); exact even for nullable columns (skipped positions add zero), so sparse `indices` skip the per-value walk too. - View arrays (`Utf8View`/`BinaryView`): lengths live in the low 32 bits of each view word, so an O(1) `n * (max_value_len + 4)` bound skips the scan in the common case; otherwise scan lengths with no data-buffer deref. - Dictionary input: treated as always-fitting — dict-encoded arrow input implies values small enough to dedup, the opposite of the blob case this targets, and a per-key walk measurably regressed the bench. Includes the arrow-writer unit tests for granular-mode round-trip and the all-null string column. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add declarative `LayoutTest` cases covering the arrow write path's page layout under the new byte budget, replacing hand-rolled page-reading loops with exact page counts/sizes: - large `Utf8` strings and `Utf8View` strings (one page per value) - large values inside a list column (record-by-record stepping) - nullable large values (def-level value counting) - dictionary spill then plain-encode transition - FixedSizeBinary byte budget Also updates the existing `test_string` dict-spill expectations: the dictionary page is now bounded at its limit and spills one mini-batch earlier instead of overshooting. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cacbd4b to
c654237
Compare
|
FYI I've rebased and rewritten the commit history to make it easier to review / for the history books once this merges. |
|
run benchmark arrow_writer |
1 similar comment
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (c654237) to 2a1d40d (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (c654237) to 2a1d40d (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing main (2a1d40d) to main diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
We write large values into our parquet files (e.g. a 5MB LLM prompt). A naive write will cause massive pages (we've seen up to 2GB) at default write settings. The main knob to control this is
write_batch_sizewhich defaults to 1024. But if each row is 5MB that's 5GB. On the other hand setting this to something small like 32 kills write performance and is completely unnecessary for other fixed width columns.The writer even documents this (
parquet/src/column/writer/mod.rs):This PR makes the mini-batch size byte-budget aware:
bytes_per_valuefrom the values about to be written and picksub_batch_size = page_byte_limit / bytes_per_value(clamped ≥ 1).sub_batch_size≥ chunk size, so we stay on the existing batched fast path with zero behavior change.Implementation notes
Skip the byte-size check while parquet dictionary encoding is active:
estimated_value_bytesreturns plain-encoded size but a dict-encoded data page only stores small RLE indices, so the estimate would spuriously shrink pages. Dict fallback bounds dict-encoded pages independently.For repeated/nested columns the sub-batch steps record-by-record (rep == 0 boundaries) so a record never spans data pages, matching the parquet format rule.
Regression test
test_column_writer_caps_page_size_for_large_byte_array_valueswrites 64 × 64 KiB BYTE_ARRAY values with a 16 KiB page byte limit. Before this fix that produced a single ~4 MiB page; after, it's one page per value (~64 pages, all within ~2× the value size).Bench results
5-run medians, criterion
arrow_writerbench, default writer properties, on a noisy laptop (run-to-run variance ~±1.6%):primitive/default(i32 25% null)primitive_non_null/defaultbool_non_null/defaultstring/defaultshort_string_non_null/default(new, 1M × 8 B)large_string_non_null/default(new, 1024 × 256 KiB)string_non_null/defaultstring_dictionary/defaultlist_primitive/defaultlist_primitive_non_null/default🤖 Generated with Claude Code