fix: sanitize Avro field names on write, respect iceberg-field-name on read by SreeramGarlapati · Pull Request #2540 · apache/iceberg-rust

SreeramGarlapati · 2026-05-30T03:40:03Z

Summary

Iceberg field names can be anything (123column, field.with.dots, etc.) but Avro requires names to match [A-Za-z_][A-Za-z0-9_]*. Java handles this with a sanitize-on-write + restore-on-read protocol using the iceberg-field-name custom Avro property. iceberg-rust was doing neither — writing invalid names directly and ignoring the property on read.

This causes two interop failures:

Write path: iceberg-rust produces invalid Avro when field names have leading digits or special chars → strict parsers (Java, Python) reject the file
Read path: Avro files written by Java with sanitized names get the wrong field names when read by iceberg-rust

Changes

Write path (Iceberg→Avro conversion in SchemaToAvroSchema::field):

Added is_valid_avro_name() — checks [A-Za-z_][A-Za-z0-9_]*
Added sanitize_avro_name() — matches Java's AvroSchemaUtil.sanitize():
- Leading digit: prefix _ (e.g., 123col → _123col)
- Special chars: _x<HEX> (e.g., field.name → field_x2Ename)
- Operates on UTF-16 code units to match Java's charAt() behavior for supplementary chars
When sanitization is needed: stores original name in iceberg-field-name property

Read path (Avro→Iceberg conversion in AvroSchemaToSchema::record):

Checks iceberg-field-name custom attribute first, falls back to Avro field name

Java reference

AvroSchemaUtil.sanitize()
TypeToSchema.struct() — stores ICEBERG_FIELD_NAME_PROP
AvroSchemaUtil.ICEBERG_FIELD_NAME_PROP

Closes #2535

Test plan

test_is_valid_avro_name — validates detection of invalid names
test_sanitize_avro_name — ASCII edge cases match Java behavior
test_sanitize_avro_name_unicode — BMP and supplementary char handling (surrogate pairs)
test_sanitization_round_trip — Iceberg→Avro→Iceberg preserves original names
test_avro_to_iceberg_uses_iceberg_field_name_property — reads Java-written schemas correctly
All 1301 existing tests pass (no regressions in bidirectional schema conversion tests)

…ld-name Iceberg field names can be arbitrary (leading digits, dots, spaces, etc.) but Avro requires names to match [A-Za-z_][A-Za-z0-9_]*. Java handles this by sanitizing invalid names on write and storing the original in an "iceberg-field-name" custom property, then checking that property on read. iceberg-rust was writing unsanitized names directly, which causes Avro validation failures (or produces files unreadable by strict Avro parsers) when field names don't conform to Avro's naming rules. This adds: - sanitize_avro_name(): matches Java's AvroSchemaUtil.sanitize() logic (prefix _ for leading digits, _x<HEX> for special chars) - Write path: sanitizes the field name and stores the original in iceberg-field-name when sanitization was needed - Read path: checks iceberg-field-name property first, falls back to the Avro field name Closes apache#2535 Co-authored-by: rawataaryan9 <rawataaryan9@users.noreply.github.com>

…tion Adds test cases for: - Non-ASCII BMP characters (U+00E9, U+4E2D) - Supplementary characters (surrogate pair handling, matching Java's UTF-16) - Empty string edge case - Read-path: iceberg-field-name property resolution from Java-written schemas - Verify iceberg-field-name property is set on dotted field names Co-authored-by: rawataaryan9 <rawataaryan9@users.noreply.github.com>

xanderbailey

Nice PR, thanks for the contribution, left a couple of questions

xanderbailey · 2026-06-01T11:26:20Z

 // This const may better to maintain in avro-rs.
 const LOGICAL_TYPE: &str = "logicalType";

+fn is_valid_avro_name(name: &str) -> bool {


It would be nice if we could use something from https://github.com/apache/avro-rs/blob/4edb1ce1ae1ab5bd3fafb08ca3f622946c01c9fd/avro/src/validator.rs#L4 but it looks like validate_record_field_name is pub(crate). Is it work filing an issue upstream to see if we could expose something that would allow us to validate against their implementation?

xanderbailey · 2026-06-01T11:41:24Z

+fn is_ascii_digit_u16(c: u16) -> bool {
+    matches!(c, 0x30..=0x39)
+}


/** * Determines if the specified character is a digit. * * A character is a digit if its general category type, provided * by {@code Character.getType(ch)}, is * {@code DECIMAL_DIGIT_NUMBER}. * * Some Unicode character ranges that contain digits: * <ul> * <li>{@code '\u005Cu0030'} through {@code '\u005Cu0039'}, * ISO-LATIN-1 digits ({@code '0'} through {@code '9'}) * <li>{@code '\u005Cu0660'} through {@code '\u005Cu0669'}, * Arabic-Indic digits * <li>{@code '\u005Cu06F0'} through {@code '\u005Cu06F9'}, * Extended Arabic-Indic digits * <li>{@code '\u005Cu0966'} through {@code '\u005Cu096F'}, * Devanagari digits * <li>{@code '\u005CuFF10'} through {@code '\u005CuFF19'}, * Fullwidth digits * </ul> * * Many other character ranges contain digits as well. * * Note: This method cannot handle <a * href="#supplementary"> supplementary characters</a>. To support * all Unicode characters, including supplementary characters, use * the {@link #isDigit(int)} method. * * @param ch the character to be tested. * @return {@code true} if the character is a digit; * {@code false} otherwise. * @see Character#digit(char, int) * @see Character#forDigit(int, int) * @see Character#getType(char) */ public static boolean isDigit(char ch) { return isDigit((int)ch); }

https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/avro/AvroSchemaUtil.java#L551
Java's isDigit() actually covers more than just ascii digits. This is a pretty niche edge-case.

Trying to reason about this in my head and I don't think it matters for interop that the representations are identical because both will correctly underscore the field and restore it from the map? Can you check my reasoning here?

SreeramGarlapati and others added 2 commits May 29, 2026 21:31

SreeramGarlapati force-pushed the fix/avro-field-name-sanitization branch from fe8c3ff to 12ba3b5 Compare May 30, 2026 04:33

SreeramGarlapati mentioned this pull request May 30, 2026

fix: respect iceberg-field-name Avro property on read path #2539

Closed

4 tasks

xanderbailey reviewed Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: sanitize Avro field names on write, respect iceberg-field-name on read#2540

fix: sanitize Avro field names on write, respect iceberg-field-name on read#2540
SreeramGarlapati wants to merge 2 commits into
apache:mainfrom
SreeramGarlapati:fix/avro-field-name-sanitization

SreeramGarlapati commented May 30, 2026

Uh oh!

xanderbailey left a comment

Uh oh!

xanderbailey Jun 1, 2026

Uh oh!

xanderbailey Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SreeramGarlapati commented May 30, 2026

Summary

Changes

Java reference

Test plan

Uh oh!

xanderbailey left a comment

Choose a reason for hiding this comment

Uh oh!

xanderbailey Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

xanderbailey Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants