iceberg: set MAP logical type on parquet map columns#30454
Merged
wdberkeley merged 2 commits intoMay 13, 2026
Merged
Conversation
Without the LogicalType.MAP annotation, strict parquet readers (Spark via SparkParquetReaders) fail with "Not a struct type" when reading the column, even though `describe` reports the column as a map. The declared type lives in the iceberg metadata and is unaffected by the missing annotation; only column read-back was broken. Mirror the LIST converter and stamp the annotation on the map root element. Cover with a unit-test assertion that the annotation is present and an end-to-end ducktape test that exercises map read-back via both Spark and Trino.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes interoperability for Iceberg Parquet map columns by ensuring the Parquet schema includes the required LogicalType.MAP annotation on map root elements, which strict readers (notably Spark) use to distinguish maps from plain repeated groups.
Changes:
- Annotate Parquet map root schema elements with
serde::parquet::map_typeduring Iceberg→Parquet schema conversion. - Strengthen the Parquet schema unit test to assert the map logical type is present.
- Add an end-to-end datalake test that writes Avro
map<string,long>values and validates read-back via both Spark and Trino.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/v/iceberg/conversion/schema_parquet.cc |
Sets LogicalType.MAP on converted Parquet map root elements. |
src/v/iceberg/conversion/tests/iceberg_parquet_tests.cc |
Adds an assertion that map fields carry the map logical type annotation. |
tests/rptest/tests/datalake/datalake_e2e_test.py |
Adds an e2e round-trip test validating map value readability via Spark and Trino. |
Trino's parquet reader is permissive enough to accept a map column written without the LogicalType.MAP annotation, so the engine-side assertion alone misses regressions that only show up in stricter readers. Add a pyiceberg read-back step that scans the column via arrow; pyiceberg rejects the malformed file (different failure mode than Spark) which gives us coverage independent of the query engine.
wdberkeley
approved these changes
May 13, 2026
Contributor
|
Oof |
Collaborator
|
/backport v26.1.x |
Collaborator
|
/backport v25.3.x |
Collaborator
|
/backport v25.2.x |
This was referenced May 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Iceberg's parquet reader uses the
LogicalType.MAPannotation on themap's root element to disambiguate it from a plain repeated group.
Without the annotation, strict readers — concretely Spark via
SparkParquetReaders→TypeWithSchemaVisitor.visitField— throwIllegalArgumentException: Not a struct type: map<...>when readingthe column.
describekeeps reporting the column as a map because thetype lives in the iceberg metadata and is unaffected by the missing
parquet annotation; only column read-back was broken. Trino's reader
happened to be lenient enough to accept the unannotated layout, which
is why this slipped past the existing
test_avro_schemamap case(it only validates
describeoutput, not values).The fix mirrors the existing LIST converter and stamps the
LogicalType.MAPannotation on the map root element. Adds a unit-testassertion that the annotation is present (the existing
Mapstestvalidated shape but not the annotation, which is why this regressed
silently), plus an end-to-end ducktape test that produces a map and
reads it back via both Spark and Trino.
Backports Required
Release Notes
Bug Fixes
(e.g. Apache Spark) due to a missing
LogicalType.MAPannotation inthe written Parquet schema.