Core: Fix metrics mode lookup for escaped column names#16417
Open
wombatu-kun wants to merge 1 commit into
Open
Core: Fix metrics mode lookup for escaped column names#16417wombatu-kun wants to merge 1 commit into
wombatu-kun wants to merge 1 commit into
Conversation
MetricsConfig keys per-column metrics modes by the original Iceberg column name, but MetricsUtil.metricsMode resolves a field id to a name using whatever schema it is given. When metrics are computed from a Parquet footer (the add_files / migrate / snapshot path via ParquetUtil.fileMetrics), the schema is reconstructed from the file and carries sanitized names (AvroSchemaUtil.makeCompatibleName, e.g. $event_time -> _x24event_time). The lookup then misses and silently falls back to the default mode, which is none once a table exceeds write.metadata.metrics.max-inferred-column-defaults, so column bounds and value counts are dropped for escaped columns. Explicit per-column overrides on escaped names are ignored on the same path. This registers, alongside the original-name column modes, a separate alias map keyed by the sanitized form of each column name (sanitizing each dotted path component). columnMode now falls back to this alias map, so lookups resolve whether the caller passes the original name (write path) or the sanitized name (footer path). The alias map is kept separate from columnModes so validateReferencedColumns continues to validate only user-supplied original names against the table schema. Adds regression tests in TestMetricsModes (inferred default and explicit override for an escaped column, queried by both names) and an end-to-end TestParquet test exercising ParquetUtil.fileMetrics on a file with an escaped column. Closes apache#11950 Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
9c25f72 to
34a0bdf
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Closes #11950
MetricsConfigkeys per-column metrics modes by the original Iceberg column name, whileMetricsUtil.metricsMode(schema, config, fieldId)resolves the field id to a name using whatever schema it is handed. When metrics are computed from a Parquet footer — theadd_files/migrate/snapshotpath viaParquetUtil.fileMetrics(TableMigrationUtil, the Delta→Iceberg snapshot action) — the schema is reconstructed from the file and carries sanitized names (AvroSchemaUtil.makeCompatibleName, e.g.$event_time→_x24event_time). The name lookup then misses and silently falls back to the default mode, which isnoneonce a table exceedswrite.metadata.metrics.max-inferred-column-defaults(default 100). Column bounds and value counts are silently dropped for escaped columns, degrading scan pruning. Explicitwrite.metadata.metrics.column.<escaped-name>overrides are ignored on the same path.The write path was already correct because
ParquetWriter.metrics()passes the writer's original Iceberg schema; only the footer-reconstructed path was affected.Fix
Alongside the original-name column modes,
MetricsConfignow derives a separate alias map keyed by the sanitized form of each column name (sanitizing each dotted path component).columnModefalls back to this alias map, so a mode resolves whether the caller passes the original name (write path) or the sanitized name (footer path). The alias map is intentionally kept separate fromcolumnModessovalidateReferencedColumnskeeps validating only user-supplied original names against the table schema. The change is confined to the private constructor, so every factory path benefits and there is no public API change (revapi clean).Tests
TestMetricsModes: inferred-default and explicit-override cases for an escaped column, assertingcolumnModeresolves by both the original and the sanitized name.TestParquet: end-to-end regression that writes a Parquet file with an escaped column and verifiesParquetUtil.fileMetricscollects value counts and bounds for it.