Data: Enable Parquet variant shredding for Record writes by soumilshah1995 · Pull Request #16370 · apache/iceberg

soumilshah1995 · 2026-05-16T21:58:32Z

Summary

Kafka Connect uses Iceberg’s generic Record model with a Void engine schema. Parquet variant shredding was ineffective on that path because the generic ParquetFormatModel did not use a variant shredding analyzer / row copier, and RecordVariantShreddingAnalyzer could not resolve VARIANT columns (engine resolveColumnIndex is a dead end for Void).

This PR wires RecordVariantShreddingAnalyzer and Record::copy into GenericFormatModels and analyzes VARIANT columns by Iceberg Schema#columns() order so buffered inference and shredded Parquet columns work for Connect.

Changes

GenericFormatModels: register ParquetFormatModel with RecordVariantShreddingAnalyzer + Record::copy.
RecordVariantShreddingAnalyzer: implement analyzeVariantColumns using positional indices aligned with Record#get.

Config (Connect)

Table write properties (e.g. via iceberg.tables.write-props):

write.parquet.shred-variants=true
write.parquet.variant-inference-buffer-size=<rows>

Test plan

./gradlew :iceberg-data:check :iceberg-kafka-connect:iceberg-kafka-connect:check (or CI green).
Connect sink writing VARIANT with write.parquet.shred-variants=true; inspect Parquet for typed_value paths / higher physical column count vs false.
Regression: Connect append with shredding disabled still succeeds.

Register ParquetFormatModel with RecordVariantShreddingAnalyzer and Record::copy, and analyze VARIANT columns using Iceberg schema column order so shredding works with Void engine schemas (Kafka Connect).

nssalian · 2026-05-19T14:59:26Z

Thanks for the contribution @soumilshah1995. Please fix the tests and get CI in a good shape.

Cover Iceberg column-order resolution and FormatModelRegistry Parquet shredding round-trip for generic Record writes (Kafka Connect path). Co-authored-by: Cursor <cursoragent@cursor.com>

Rename setup() to before() per Iceberg test naming convention. Co-authored-by: Cursor <cursoragent@cursor.com>

soumilshah1995 · 2026-05-19T19:41:09Z

@nssalian
Thank you very much for taking the time to review my changes. This is my first contribution, and I truly appreciate your patience and guidance. I have addressed the changes you mentioned. Please let me know if there is anything else you would like me to update or improve.

soumilshah199500 · 2026-05-21T14:14:40Z

Hi @laskoviymishka can you review PR Thank you

laskoviymishka · 2026-05-21T14:42:32Z

I will take a look (was waiting to CI become green :D)

nssalian

Left a few comments.
Separately, I think it would be great to update the title to be more Data: Enable Parquet variant shredding for Record writes since this is specifically in the data module.

nssalian · 2026-05-21T18:26:53Z

+ * <p>Shredding extracts frequently-occurring fields from variant data into typed Parquet columns
+ * for improved query performance while maintaining the full variant data in the raw value field.
+ */
+class RecordVariantShreddingAnalyzer extends VariantShreddingAnalyzer<Record, Void> {


A few issues to note in the doc string here:

"used by Kafka Connect and other tools" misframes this class as Connect-specific. PR title needs a change too.

The "Shredding extracts frequently-occurring fields..." paragraph describes what shredding is; that belongs on the base class VariantShreddingAnalyzer, not the Record-specific subclass. I don't think this is needed here.

The doc string doesn't actually describe what's specific about this implementation (positional indexing aligned with Record.get(int)).

Would be great to be consistent with SparkVariantShreddingAnalyzer

Thanks a lot for reviewing this. I’ll make the necessary changes.

nssalian · 2026-05-21T18:28:08Z

+   * -1}, so variant columns were never analyzed and Parquet shredding never activated for Kafka
+   * Connect and other Record-based writers.
+   */
+  @Override


This javadoc describes pre-fix state ("is unused", "always produced -1", "never activated") rather than what the method
does. Suggest dropping it - {@inheritDoc} is the default for overrides and the override's behavior needs to be explained instead

nssalian · 2026-05-21T18:30:14Z

+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
+
+public class TestRecordVariantShreddingAnalyzer {


No test covers a record with a null variant value. Could you add one?

nssalian · 2026-05-21T18:33:51Z

+   * -1}, so variant columns were never analyzed and Parquet shredding never activated for Kafka
+   * Connect and other Record-based writers.
+   */
+  @Override


This override duplicates the loop from VariantShreddingAnalyzer.analyzeVariantColumns. Spark/Flink override only the protected hooks (resolveColumnIndex, extractVariantValues), not the public template.

nssalian · 2026-05-21T18:35:45Z

+  }
+
+  @Override
+  protected int resolveColumnIndex(Void engineSchema, String columnName) {


Is this getting called?

nssalian · 2026-05-21T18:37:38Z

+  }
+
+  @Override
+  protected List<VariantValue> extractVariantValues(


instanceof Variant silently skips both nulls and non-Variant values. Should this throw on non-Variant so caller bugs surface instead of silently shrinking the analysis set? Worth referring to the Spark and Flink implementation.

Use Schema-based resolveColumnIndex instead of overriding analyzeVariantColumns, set engine schema in ParquetFormatModel for generic Record writes, reject non-Variant column values, and add null/wrong-type tests. Co-authored-by: Cursor <cursoragent@cursor.com>

laskoviymishka

Nice rework @soumilshah1995 -- the analyzer is much cleaner now, and wiring it through ParquetFormatModel.schema() feels like the right shape.

Two things before merge:

First, I don’t think the KC path is actually wired yet. RegistryBasedFileWriterFactory.newDataWriter() calls .schema(s).engineSchema(inputSchema()). In the Connect path, inputSchema() is null, so we first auto-populate engineSchema = icebergSchema, then immediately overwrite it with null.

After that, buildShreddedAppender() gets a null engine schema, resolveColumnIndex returns -1 for every column, and shredding silently does not activate for KC writes.

The current test misses this because it builds the writer directly through dataWriteBuilder(...).schema(...).build(), not through GenericFileWriterFactory.

Simplest fix is probably a null guard in WriteBuilderWrapper.engineSchema(), so the auto-populated value sticks. Or RegistryBasedFileWriterFactory can skip .engineSchema(...) when inputSchema() is null.

Can we also add a KC-shaped test through:

GenericFileWriterFactory.Builder.writerProperties("write.parquet.shred-variants", "true").build()

and assert typed_value groups show up in the output Parquet?

I’ll ping @AnatolyPopov too since he knows the kafka-connect side deeper.

Second, RecordVariantShreddingAnalyzer.resolveColumnIndex should fail loudly on null engineSchema, not return -1. Once the writer fix lands this should be unreachable, but silent fallback is exactly how this slipped through. Spark does a checkNotNull here too.

A few non-blockers:

Connect creates one TaskWriter per partition, so different partitions can infer different typed_value schemas in the same snapshot. Spec allows it, but worth a doc note.
PARQUET_VARIANT_BUFFER_SIZE_DEFAULT = 100 feels small for KC, where max.poll.records is often 500+. Worth documenting write.parquet.variant-inference-buffer-size as a knob operators may need to bump.
Two small inline nits: O(n) lookup for wide schemas, and raw-row assertion only checks row 0. Fine as follow-ups.

Fix the engineSchema overwrite, add the null check + KC-path test, and this should be good to land.

laskoviymishka · 2026-05-21T20:53:57Z

    public ModelWriteBuilder<D, S> schema(Schema newSchema) {
      this.schema = newSchema;
      internal.schema(newSchema);
+      if (Schema.class.equals(schemaType)) {


Right idea, but I think two things need to happen here for this to actually work end-to-end.

First, this auto-set fires every time schema(...) is called and silently clobbers a caller-set engineSchema if the order happens to be engineSchema(custom).schema(s). I'd gate it so explicit wins:

if (this.engineSchema == null && Schema.class.equals(schemaType)) { this.engineSchema = (S) newSchema; }

Second — and this is the bigger one — RegistryBasedFileWriterFactory.newDataWriter() calls .schema(s).engineSchema(inputSchema()) back-to-back, and for the KC path GenericFileWriterFactory.Builder.build() goes through the deprecated 10-arg constructor and passes null for inputSchema. So this auto-populate does the right thing, then .engineSchema(null) immediately overwrites it because WriteBuilderWrapper.engineSchema(...) (just below) has no null guard. By the time buildShreddedAppender() runs, resolveColumnIndex gets null, returns -1 for every column, and shredding silently doesn't activate for KC writes. A null guard on the engineSchema(...) setter — skip the assignment if newSchema == null — would make the auto-populated value stick.

Also worth a @SuppressWarnings("unchecked") on the cast (or pulling it into a small helper) — it currently emits an unchecked-cast warning.

I'll ping @AnatolyPopov to take a look too since he's closer to the KC side and can sanity-check the wiring claim.

laskoviymishka · 2026-05-21T20:53:58Z

+  @Override
+  protected int resolveColumnIndex(Schema engineSchema, String columnName) {
+    if (engineSchema == null) {
+      return -1;


Returning -1 on a null engineSchema silently disables shredding for the whole column. Once the writer-side fix is in (see my comment in ParquetFormatModel.java), this should be unreachable in normal paths, so I'd rather it fail loudly than degrade silently:

Preconditions.checkNotNull(engineSchema, "Invalid engine schema: null");

Spark's analyzer does the same — calls sparkSchema.fieldIndex(name) with no null handling.

laskoviymishka · 2026-05-21T20:53:58Z

+    }
+
+    List<NestedField> cols = engineSchema.columns();
+    for (int i = 0; i < cols.size(); i++) {


Non-blocker, but worth thinking about: this is O(n) per variant column, so O(n·k) for wide schemas (CDC tables with 200+ columns are not uncommon). Spark uses sparkSchema.fieldIndex(name) which is O(1). A precomputed Map<String, Integer> cached on the instance would match that.

Also, while we're here — the class Javadoc says "positional indices aligned with Schema#columns()", and extractVariantValues then uses record.get(int) against that same index. That's a real contract: the records being analyzed must have been built against the same schema passed as engineSchema. Worth one sentence in the Javadoc making that explicit so a future caller doesn't pass a projected schema and get silent misalignment.

laskoviymishka · 2026-05-21T20:53:58Z

+      assertThat(variantData.getFieldRepetitionCount("value")).isEqualTo(0);
+
+      Group typedValue = variantData.getGroup("typed_value", 0);
+      assertThat(typedValue.getGroup("a", 0).getInteger("typed_value", 0)).isEqualTo(42);


Two things on this round-trip test.

The high-level read-back via InternalTestHelpers.assertEquals covers all three records, which is great. The raw Parquet read here only checks row 0 though, and with BUFFER_SIZE=2 and 3 input rows the second buffer flush (rows 2→3 boundary) is exactly what we'd want physical evidence for. A short while ((row = rawReader.read()) != null) loop asserting each row's typed_value would close that. Fine as a follow-up.

More importantly: there's no test that goes through RegistryBasedFileWriterFactory (the actual KC production path). This test calls dataWriteBuilder(...).schema(...).build() directly and never touches .engineSchema(...), so it exercises exactly the path the PR's schema() fix targets but not the KC shape (.schema(s).engineSchema(null)) — which is why the engineSchema overwrite I flagged in ParquetFormatModel.java slipped through. A test that builds via GenericFileWriterFactory.Builder.writerProperties("write.parquet.shred-variants", "true").build() and asserts typed_value groups appear in the output would catch that regression.

Preserve auto-populated engine schema when callers pass null, cache column index lookups, and add KC factory plus multi-row shredding round-trip tests. Co-authored-by: Cursor <cursoragent@cursor.com>

Reconcile ParquetFormatModel with upstream copyFuncFactory API while keeping engineSchema auto-population for generic Record writes. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions Bot added the data label May 16, 2026

soumilshah1995 force-pushed the kafka-connect-shredding branch 2 times, most recently from 8ce1a9d to ccb48f1 Compare May 16, 2026 22:00

Kafka Connect: Enable Parquet variant shredding for generic Record

67bfc1a

Register ParquetFormatModel with RecordVariantShreddingAnalyzer and Record::copy, and analyze VARIANT columns using Iceberg schema column order so shredding works with Void engine schemas (Kafka Connect).

soumilshah1995 force-pushed the kafka-connect-shredding branch from ccb48f1 to 67bfc1a Compare May 16, 2026 22:05

This was referenced May 17, 2026

Kafka Connect: Enable Parquet variant shredding for generic Record writes#16370 #16387

Open

Kafka Connect: Support VARIANT when record convert #15283

Merged

soumilshah199500 and others added 2 commits May 19, 2026 14:26

Data: Add tests for Record variant shredding analyzer

ce90bc0

Cover Iceberg column-order resolution and FormatModelRegistry Parquet shredding round-trip for generic Record writes (Kafka Connect path). Co-authored-by: Cursor <cursoragent@cursor.com>

Data: Fix checkstyle in Record variant shredding test

03ff0d2

Rename setup() to before() per Iceberg test naming convention. Co-authored-by: Cursor <cursoragent@cursor.com>

laskoviymishka self-requested a review May 21, 2026 14:42

nssalian suggested changes May 21, 2026

View reviewed changes

soumilshah1995 changed the title ~~Kafka Connect: Enable Parquet variant shredding for generic Record writes~~ Data: Enable Parquet variant shredding for Record writes May 21, 2026

github-actions Bot added the parquet label May 21, 2026

laskoviymishka requested changes May 21, 2026

View reviewed changes

soumilshah199500 and others added 2 commits May 21, 2026 17:26

Data: Harden Record variant shredding for KC write path

ef24bfa

Preserve auto-populated engine schema when callers pass null, cache column index lookups, and add KC factory plus multi-row shredding round-trip tests. Co-authored-by: Cursor <cursoragent@cursor.com>

Merge upstream/main and fix spotless/conflicts

4f6cd1f

Reconcile ParquetFormatModel with upstream copyFuncFactory API while keeping engineSchema auto-population for generic Record writes. Co-authored-by: Cursor <cursoragent@cursor.com>

Conversation

soumilshah1995 commented May 16, 2026

Summary

Changes

Config (Connect)

Test plan

Uh oh!

nssalian commented May 19, 2026

Uh oh!

soumilshah1995 commented May 19, 2026

Uh oh!

soumilshah199500 commented May 21, 2026

Uh oh!

laskoviymishka commented May 21, 2026

Uh oh!

nssalian left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laskoviymishka left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

laskoviymishka left a comment •

edited

Loading