Skip to content

Core: Fix metrics mode lookup for escaped column names#16417

Open
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:issue/11950-escaped-column-metrics
Open

Core: Fix metrics mode lookup for escaped column names#16417
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:issue/11950-escaped-column-metrics

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

@wombatu-kun wombatu-kun commented May 19, 2026

Problem

Closes #11950

MetricsConfig keys per-column metrics modes by the original Iceberg column name, while MetricsUtil.metricsMode(schema, config, fieldId) resolves the field id to a name using whatever schema it is handed. When metrics are computed from a Parquet footer — the add_files / migrate / snapshot path via ParquetUtil.fileMetrics (TableMigrationUtil, the Delta→Iceberg snapshot action) — the schema is reconstructed from the file and carries sanitized names (AvroSchemaUtil.makeCompatibleName, e.g. $event_time_x24event_time). The name lookup then misses and silently falls back to the default mode, which is none once a table exceeds write.metadata.metrics.max-inferred-column-defaults (default 100). Column bounds and value counts are silently dropped for escaped columns, degrading scan pruning. Explicit write.metadata.metrics.column.<escaped-name> overrides are ignored on the same path.

The write path was already correct because ParquetWriter.metrics() passes the writer's original Iceberg schema; only the footer-reconstructed path was affected.

Fix

Alongside the original-name column modes, MetricsConfig now derives a separate alias map keyed by the sanitized form of each column name (sanitizing each dotted path component). columnMode falls back to this alias map, so a mode resolves whether the caller passes the original name (write path) or the sanitized name (footer path). The alias map is intentionally kept separate from columnModes so validateReferencedColumns keeps validating only user-supplied original names against the table schema. The change is confined to the private constructor, so every factory path benefits and there is no public API change (revapi clean).

Tests

  • TestMetricsModes: inferred-default and explicit-override cases for an escaped column, asserting columnMode resolves by both the original and the sanitized name.
  • TestParquet: end-to-end regression that writes a Parquet file with an escaped column and verifies ParquetUtil.fileMetrics collects value counts and bounds for it.

MetricsConfig keys per-column metrics modes by the original Iceberg column name, but MetricsUtil.metricsMode resolves a field id to a name using whatever schema it is given. When metrics are computed from a Parquet footer (the add_files / migrate / snapshot path via ParquetUtil.fileMetrics), the schema is reconstructed from the file and carries sanitized names (AvroSchemaUtil.makeCompatibleName, e.g. $event_time -> _x24event_time). The lookup then misses and silently falls back to the default mode, which is none once a table exceeds write.metadata.metrics.max-inferred-column-defaults, so column bounds and value counts are dropped for escaped columns. Explicit per-column overrides on escaped names are ignored on the same path.

This registers, alongside the original-name column modes, a separate alias map keyed by the sanitized form of each column name (sanitizing each dotted path component). columnMode now falls back to this alias map, so lookups resolve whether the caller passes the original name (write path) or the sanitized name (footer path). The alias map is kept separate from columnModes so validateReferencedColumns continues to validate only user-supplied original names against the table schema.

Adds regression tests in TestMetricsModes (inferred default and explicit override for an escaped column, queried by both names) and an end-to-end TestParquet test exercising ParquetUtil.fileMetrics on a file with an escaped column.

Closes apache#11950

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@wombatu-kun wombatu-kun force-pushed the issue/11950-escaped-column-metrics branch from 9c25f72 to 34a0bdf Compare May 19, 2026 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Column statistics are not collected when the column name is escaped

1 participant