Skip to content

iceberg: set MAP logical type on parquet map columns#30454

Merged
wdberkeley merged 2 commits into
redpanda-data:devfrom
nvartolomei:nv/iceberg-map-parquet-logical-type
May 13, 2026
Merged

iceberg: set MAP logical type on parquet map columns#30454
wdberkeley merged 2 commits into
redpanda-data:devfrom
nvartolomei:nv/iceberg-map-parquet-logical-type

Conversation

@nvartolomei
Copy link
Copy Markdown
Contributor

@nvartolomei nvartolomei commented May 13, 2026

Iceberg's parquet reader uses the LogicalType.MAP annotation on the
map's root element to disambiguate it from a plain repeated group.
Without the annotation, strict readers — concretely Spark via
SparkParquetReadersTypeWithSchemaVisitor.visitField — throw
IllegalArgumentException: Not a struct type: map<...> when reading
the column. describe keeps reporting the column as a map because the
type lives in the iceberg metadata and is unaffected by the missing
parquet annotation; only column read-back was broken. Trino's reader
happened to be lenient enough to accept the unannotated layout, which
is why this slipped past the existing test_avro_schema map case
(it only validates describe output, not values).

The fix mirrors the existing LIST converter and stamps the
LogicalType.MAP annotation on the map root element. Adds a unit-test
assertion that the annotation is present (the existing Maps test
validated shape but not the annotation, which is why this regressed
silently), plus an end-to-end ducktape test that produces a map and
reads it back via both Spark and Trino.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

Bug Fixes

  • Fix Iceberg map columns being unreadable from strict Parquet readers
    (e.g. Apache Spark) due to a missing LogicalType.MAP annotation in
    the written Parquet schema.

Without the LogicalType.MAP annotation, strict parquet readers (Spark
via SparkParquetReaders) fail with "Not a struct type" when reading
the column, even though `describe` reports the column as a map. The
declared type lives in the iceberg metadata and is unaffected by the
missing annotation; only column read-back was broken.

Mirror the LIST converter and stamp the annotation on the map root
element. Cover with a unit-test assertion that the annotation is
present and an end-to-end ducktape test that exercises map read-back
via both Spark and Trino.
Copilot AI review requested due to automatic review settings May 13, 2026 08:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes interoperability for Iceberg Parquet map columns by ensuring the Parquet schema includes the required LogicalType.MAP annotation on map root elements, which strict readers (notably Spark) use to distinguish maps from plain repeated groups.

Changes:

  • Annotate Parquet map root schema elements with serde::parquet::map_type during Iceberg→Parquet schema conversion.
  • Strengthen the Parquet schema unit test to assert the map logical type is present.
  • Add an end-to-end datalake test that writes Avro map<string,long> values and validates read-back via both Spark and Trino.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
src/v/iceberg/conversion/schema_parquet.cc Sets LogicalType.MAP on converted Parquet map root elements.
src/v/iceberg/conversion/tests/iceberg_parquet_tests.cc Adds an assertion that map fields carry the map logical type annotation.
tests/rptest/tests/datalake/datalake_e2e_test.py Adds an e2e round-trip test validating map value readability via Spark and Trino.

Trino's parquet reader is permissive enough to accept a map column
written without the LogicalType.MAP annotation, so the engine-side
assertion alone misses regressions that only show up in stricter
readers. Add a pyiceberg read-back step that scans the column via
arrow; pyiceberg rejects the malformed file (different failure mode
than Spark) which gives us coverage independent of the query engine.
@nvartolomei nvartolomei requested review from andrwng and wdberkeley May 13, 2026 09:56
@wdberkeley
Copy link
Copy Markdown
Contributor

Oof

Copy link
Copy Markdown
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes, great find.

@wdberkeley wdberkeley merged commit feb53a1 into redpanda-data:dev May 13, 2026
22 checks passed
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v26.1.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.3.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.2.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants